hub Baseline reference

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng · 2025 · cs.CV · arXiv 2504.17761

Baseline reference. 65% of citing Pith papers use this work as a benchmark or comparison.

67 Pith papers citing it

Baseline 65% of classified citations

open full Pith review browse 67 citing papers arXiv PDF

abstract

In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 dataset 12 baseline 9 method 1

citation-polarity summary

background 12 use dataset 12 baseline 10

claims ledger

abstract In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing mode

co-cited works

representative citing papers

Masked Generative Transformer Is What You Need for Image Editing

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

cs.CV · 2026-05-06 · unverdicted · novelty 7.0 · 2 refs

FlowDIS uses flow matching to transport image distributions to mask distributions, optionally conditioned on text, and outperforms prior DIS methods by 5.5% on F_beta^omega and 43% on MAE.

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.

Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

cs.CV · 2026-03-23 · unverdicted · novelty 7.0

Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

cs.CV · 2026-02-27 · unverdicted · novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

cs.CV · 2026-01-21 · unverdicted · novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

cs.CV · 2025-12-25 · unverdicted · novelty 7.0

InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

AesFormer: Transform Everyday Photos into Beautiful Memories

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

AesFormer decouples aesthetic planning from image editing via AesThinker and AesEditor to enable structural reconstruction in photos for better aesthetics.

citing papers explorer

Showing 50 of 67 citing papers.

Masked Generative Transformer Is What You Need for Image Editing cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset cs.CV · 2026-05-22 · unverdicted · none · ref 29 · internal anchor
VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 25 · 2 links · internal anchor
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing cs.CV · 2026-05-14 · unverdicted · none · ref 24 · internal anchor
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Inline Critic Steers Image Editing cs.CV · 2026-05-12 · conditional · none · ref 30 · internal anchor
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing cs.CR · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching cs.CV · 2026-05-06 · unverdicted · none · ref 21 · 2 links · internal anchor
FlowDIS uses flow matching to transport image distributions to mask distributions, optionally conditioned on text, and outperforms prior DIS methods by 5.5% on F_beta^omega and 43% on MAE.
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing cs.CV · 2026-05-04 · unverdicted · none · ref 34 · internal anchor
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking cs.CV · 2026-05-04 · unverdicted · none · ref 3 · internal anchor
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing cs.CV · 2026-04-26 · unverdicted · none · ref 31 · internal anchor
A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing cs.CV · 2026-04-21 · unverdicted · none · ref 26 · internal anchor
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 50 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 33 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control cs.CV · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 26 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off cs.CV · 2026-03-23 · unverdicted · none · ref 35 · internal anchor
Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model cs.CV · 2026-02-27 · unverdicted · none · ref 15 · internal anchor
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects cs.CV · 2026-01-21 · unverdicted · none · ref 29 · internal anchor
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation cs.CV · 2025-12-25 · unverdicted · none · ref 20 · internal anchor
InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image generation.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization cs.CV · 2025-12-11 · unverdicted · none · ref 39 · internal anchor
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 8 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
AesFormer: Transform Everyday Photos into Beautiful Memories cs.CV · 2026-05-21 · unverdicted · none · ref 13 · internal anchor
AesFormer decouples aesthetic planning from image editing via AesThinker and AesEditor to enable structural reconstruction in photos for better aesthetics.
TextSculptor: Training and Benchmarking Scene Text Editing cs.CV · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 34 · internal anchor
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 76 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models cs.CV · 2026-05-12 · conditional · none · ref 16 · 2 links · internal anchor
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 22 · 2 links · internal anchor
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
GeoR-Bench: Evaluating Geoscience Visual Reasoning cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling cs.CV · 2026-05-04 · unverdicted · none · ref 48 · internal anchor
MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 35 · 2 links · internal anchor
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CV · 2026-04-29 · unverdicted · none · ref 28 · internal anchor
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing cs.CV · 2026-04-28 · unverdicted · none · ref 11 · internal anchor
DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing cs.CV · 2026-04-27 · unverdicted · none · ref 38 · internal anchor
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction cs.CV · 2026-04-23 · unverdicted · none · ref 42 · internal anchor
MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 23 · internal anchor
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing cs.CV · 2026-04-22 · unverdicted · none · ref 21 · internal anchor
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
LIVE: Leveraging Image Manipulation Priors for Instruction-based Video Editing cs.CV · 2026-04-18 · unverdicted · none · ref 24 · internal anchor
LIVE achieves state-of-the-art instruction-based video editing by jointly training on image and video data with a frame-wise token noise strategy to bridge domain gaps and a new benchmark of over 60 tasks.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training cs.AI · 2026-04-12 · unverdicted · none · ref 15 · 2 links · internal anchor
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation cs.CV · 2026-04-09 · unverdicted · none · ref 24 · internal anchor
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing cs.CV · 2026-04-06 · unverdicted · none · ref 32 · internal anchor
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 58 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning cs.CV · 2025-09-24 · unverdicted · none · ref 18 · internal anchor
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space cs.GR · 2025-06-17 · unverdicted · none · ref 31 · internal anchor
FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 46 · internal anchor
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.

Step1X-Edit: A Practical Framework for General Image Editing

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer