arxiv: 2505.20275 · v1 · submitted 2025-05-26 · 💻 cs.CV

Recognition: 2 theorem links

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye , Xianyi He , Zongjian Li , Bin Lin , Shenghai Yuan , Zhiyuan Yan , Bohan Hou , Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingdatasetbenchmarkvision-language modelgenerative modelsmulti-turn editinginstruction followinginpainting

0 comments

The pith

ImgEdit supplies 1.2 million curated edit pairs that let a vision-language-model editor outperform prior open-source systems on instruction-based image changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create ImgEdit, a dataset of 1.2 million single-turn and multi-turn image edit pairs produced by a pipeline that chains a vision-language model with detection, segmentation, inpainting, and post-processing steps. They use the dataset to train ImgEdit-E1, which processes a reference image plus text prompt and shows stronger results than other open-source editing models across several tasks. They also release ImgEdit-Bench, which tests models on instruction adherence, editing quality, and detail preservation through basic, hard single-turn, and multi-turn suites. A sympathetic reader would care because the work directly tackles the data bottleneck that has kept open models behind closed ones, offering both training material and a standardized way to measure progress.

Core claim

ImgEdit is a large-scale image-editing dataset of 1.2 million carefully curated pairs that contain both novel complex single-turn edits and challenging multi-turn tasks; a multi-stage pipeline using a vision-language model, detection, segmentation, inpainting, and strict post-processing ensures quality and diversity; models trained on ImgEdit, specifically the VLM-based ImgEdit-E1, outperform existing open-source editors on multiple tasks; and ImgEdit-Bench provides evaluation across instruction adherence, editing quality, and detail preservation for open-source, proprietary, and the new model.

What carries the argument

The multi-stage curation pipeline that integrates a vision-language model, detection model, segmentation model, task-specific inpainting, and post-processing to generate high-quality edit pairs from reference images and prompts.

If this is right

Open-source image editing can advance on both single-turn and multi-turn instructions once high-quality paired data is available.
Standardized benchmarks like ImgEdit-Bench expose concrete gaps in current models' ability to preserve details while following edits.
The same curation approach could scale to produce even larger training sets without manual labeling.
Proprietary models may lose their edge if open models continue to train on comparably clean and diverse pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline's reliance on existing detection and segmentation tools suggests that further gains in those sub-models would automatically improve future editing datasets.
Multi-turn evaluation suites could become the default test for interactive creative tools, shifting research focus from one-shot generation to iterative refinement.
If the dataset is adopted widely, community fine-tunes of ImgEdit-E1 may appear that specialize in domains such as product photography or medical imagery.

Load-bearing premise

The pairs produced by the automated multi-stage pipeline are sufficiently free of curation artifacts and diverse enough that models trained on them generalize to real editing requests rather than learning pipeline-specific patterns.

What would settle it

Run ImgEdit-E1 and competing open-source models on a fresh set of user-provided prompts and images never seen during curation, then measure whether human raters still judge ImgEdit-E1 edits as superior in instruction match and visual quality.

read the original abstract

Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ImgEdit ships a 1.2M-pair editing dataset with multi-turn tasks and a new benchmark, but the quality of the generated pairs rests on an unvalidated pipeline.

read the letter

The main takeaway is a new 1.2 million image-edit pair dataset that includes both single-turn and multi-turn examples, plus ImgEdit-Bench with its basic, hard single-turn, and dedicated multi-turn suites. They train ImgEdit-E1 on it and report better results than other open-source editors on their own tests. The public GitHub release is a concrete plus for anyone who needs paired data at this scale. The curation uses a VLM plus detection, segmentation, inpainting, and post-processing steps, which lets them generate more complex edits than most prior sets. That combination of size and multi-turn coverage is the clearest advance. The soft spot is the thin evidence for data quality. The abstract describes the pipeline and claims higher novelty and quality, but gives no human ratings, inter-annotator numbers, error analysis, or ablation on the pipeline stages. Without those checks it is hard to rule out that the model is simply learning artifacts from the VLM or inpainting steps rather than robust editing. The outperformance on ImgEdit-Bench could be partly circular if the test cases share the same generation biases. This paper is for researchers who train or evaluate open image-editing models and need large paired data or a standardized testbed. Readers who want to build on the released resource will get immediate value even if the validation is light. It has enough substance to deserve a serious referee rather than a desk reject, though any review should press for quantitative checks on the curation pipeline.

Referee Report

3 major / 2 minor

Summary. The paper introduces ImgEdit, a dataset of 1.2 million image-editing pairs curated via a multi-stage pipeline (VLM prompt generation, detection, segmentation, inpainting, and post-processing) that claims higher novelty and quality than prior datasets. It trains ImgEdit-E1, a VLM-based editing model, and reports that this model outperforms existing open-source models on the authors' new ImgEdit-Bench, which comprises basic, challenging single-turn, and multi-turn suites measuring instruction adherence, editing quality, and detail preservation.

Significance. If the pipeline produces genuinely high-quality, diverse, and generalizable edit pairs without systematic curation artifacts, the work would be significant: it supplies a large public resource that could narrow the gap between open-source and proprietary editing models, while the benchmark offers a standardized evaluation framework. The public release of the data and code is a clear strength that supports reproducibility.

major comments (3)

[Section 3] Section 3 (dataset construction): the multi-stage pipeline is presented without any quantitative validation of output quality (human ratings, inter-annotator agreement, error analysis, or ablation on individual stages such as VLM prompt generation or inpainting). This directly underpins the central claim that ImgEdit surpasses existing datasets in quality and novelty.
[Section 4] Section 4 (model training and results): the reported outperformance of ImgEdit-E1 is given without details on how benchmark scores were computed, without ablations isolating the contribution of the new dataset versus model architecture, and without checks for pipeline-induced biases that could make superiority non-generalizable.
[Section 5] Section 5 (ImgEdit-Bench): the benchmark description lacks explicit definitions or formulas for the three evaluation axes (instruction adherence, editing quality, detail preservation) and provides no analysis of metric reliability or potential annotation artifacts in the test suites.

minor comments (2)

[Abstract / Section 3] The abstract and Section 3 refer to 'strict post-processing' without enumerating the exact filtering criteria or thresholds, which would aid reproducibility.
[Tables/Figures] Table or figure captions comparing ImgEdit to prior datasets could more explicitly list the exact metrics used for the 'novelty' and 'quality' claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our paper. We address each of the major comments point by point below, providing clarifications and committing to revisions where necessary to strengthen the manuscript.

read point-by-point responses

Referee: [Section 3] Section 3 (dataset construction): the multi-stage pipeline is presented without any quantitative validation of output quality (human ratings, inter-annotator agreement, error analysis, or ablation on individual stages such as VLM prompt generation or inpainting). This directly underpins the central claim that ImgEdit surpasses existing datasets in quality and novelty.

Authors: We appreciate the referee pointing out the need for quantitative validation to support our claims about the dataset's quality and novelty. Although the pipeline is designed with multiple quality-control stages, we agree that empirical validation is essential. In the revised manuscript, we will add human evaluation results on a subset of the data, inter-annotator agreement metrics, detailed error analysis, and ablations studying the impact of individual components like the VLM prompt generation and inpainting steps. This will provide concrete evidence that ImgEdit offers higher quality and novelty compared to existing datasets. revision: yes
Referee: [Section 4] Section 4 (model training and results): the reported outperformance of ImgEdit-E1 is given without details on how benchmark scores were computed, without ablations isolating the contribution of the new dataset versus model architecture, and without checks for pipeline-induced biases that could make superiority non-generalizable.

Authors: We acknowledge that additional details and analyses are required to fully substantiate the outperformance claims. In the revision, we will provide explicit details on the computation of the benchmark scores, include ablation studies that isolate the contributions of the ImgEdit dataset versus the VLM-based architecture, and conduct an analysis of potential biases arising from the curation pipeline, along with discussions on how these might affect the generalizability of the results. revision: yes
Referee: [Section 5] Section 5 (ImgEdit-Bench): the benchmark description lacks explicit definitions or formulas for the three evaluation axes (instruction adherence, editing quality, detail preservation) and provides no analysis of metric reliability or potential annotation artifacts in the test suites.

Authors: We agree that clear definitions, formulas, and reliability analysis are important for the benchmark's utility. We will revise Section 5 to include explicit definitions and mathematical formulas for the three evaluation axes: instruction adherence, editing quality, and detail preservation. Furthermore, we will add an analysis of the metrics' reliability and discuss potential annotation artifacts or biases in the basic, challenging single-turn, and multi-turn test suites. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark construction

full rationale

The paper presents an empirical contribution: curation of 1.2M edit pairs via a multi-stage pipeline (VLM + detection + segmentation + inpainting + post-processing), training of ImgEdit-E1 on those pairs, and introduction of ImgEdit-Bench for evaluation. No equations, fitted parameters, or derivations are claimed. The central claims (dataset quality, model outperformance) are externally falsifiable via human ratings, ablations, or comparisons on held-out data and do not reduce to self-definition or self-citation chains. Self-citations, if present, are not load-bearing for any mathematical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard computer-vision assumptions about the reliability of off-the-shelf detection, segmentation, and inpainting models plus the ability of a VLM to generate useful edit instructions; no new entities or free parameters are introduced in the abstract.

axioms (1)

domain assumption Off-the-shelf vision-language, detection, and segmentation models produce sufficiently accurate outputs for curation without introducing systematic biases that degrade downstream editing performance.
Invoked in the description of the multi-stage pipeline.

pith-pipeline@v0.9.0 · 5585 in / 1307 out tokens · 41325 ms · 2026-05-12T18:13:13.948179+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion-based pipeline creates a 27M-annotation dataset of object placements that outperforms human annotations and baselines on image editing tasks, then distills it into a fast model.
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
cs.CV 2026-04 unverdicted novelty 7.0

AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
cs.CV 2026-04 unverdicted novelty 7.0

CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
GeoR-Bench: Evaluating Geoscience Visual Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
cs.CV 2026-05 unverdicted novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions
cs.CV 2026-04 unverdicted novelty 6.0

An MLLM agent reformulates image editing tasks into executable operation sequences to improve reliability on challenging cases across existing generative backbones.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
cs.CV 2026-04 unverdicted novelty 6.0

VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
cs.CV 2026-04 unverdicted novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents
cs.AI 2026-05 unverdicted novelty 5.0

DataEvolver introduces a reusable framework with generation-time self-correction and validation-time self-expansion loops that improves visual datasets, shown to outperform baselines on an object-rotation task.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 30 Pith papers · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

work page 2019
[3]

Humanedit: A high-quality human-rewarded dataset for instruction-based image editing

Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing. arXiv preprint arXiv:2412.04280, 2024

work page arXiv 2024
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Editval: Benchmarking diffusion based text-guided image editing methods

Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Mas- siceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. Editval: Benchmarking diffusion based text-guided image editing methods. arXiv preprint arXiv:2310.02426, 2023

work page arXiv 2023
[6]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022

work page 2023
[7]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[8]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024

work page 2024
[9]

Sequential attention gan for interactive image editing

Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, and Jianfeng Gao. Sequential attention gan for interactive image editing. In Proceedings of the 28th ACM international conference on multimedia, pages 4383–4391, 2020

work page 2020
[10]

Easy2hard-bench: Standardized difficulty labels for profiling llm performance and generalization

Mucong Ding, Chenghao Deng, Jocelyn Choo, Zichu Wu, Aakriti Agrawal, Avi Schwarzschild, Tianyi Zhou, Tom Goldstein, John Langford, Animashree Anandkumar, et al. Easy2hard-bench: Standardized difficulty labels for profiling llm performance and generalization. Advances in Neural Information Processing Systems, 37:44323–44365, 2024

work page 2024
[11]

Tell, draw, and repeat: Gen- erating and modifying images based on continual linguistic instruction

Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, R Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, and Graham Taylor. Tell, draw, and repeat: Gen- erating and modifying images based on continual linguistic instruction. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10303–10311, 2019

work page 2019
[13]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024
[14]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025a

Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, and Hongsheng Li. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing. arXiv preprint arXiv:2503.10639, 2025. 10

work page arXiv 2025
[15]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. arXiv preprint arXiv:2412.19806, 2024

work page arXiv 2024
[16]

arXiv preprint arXiv:2309.17102 (2023)

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023

work page arXiv 2023
[17]

Sscr: Iterative language-based image editing via self-supervised counterfactual reasoning

Tsu-Jui Fu, Xin Eric Wang, Scott Grafton, Miguel Eckstein, and William Yang Wang. Sscr: Iterative language-based image editing via self-supervised counterfactual reasoning. arXiv preprint arXiv:2009.09566, 2020

work page arXiv 2009
[18]

Seed-data-edit technical report: A hybrid dataset for instructional image editing

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing. arXiv preprint arXiv:2405.04007, 2024

work page arXiv 2024
[19]

Experiment with gemini 2.0 flash native image generation, 2025

Google Gemini2. Experiment with gemini 2.0 flash native image generation, 2025

work page 2025
[20]

Instructdiffusion: A generalist modeling interface for vision tasks

Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li, Han Hu, et al. Instructdiffusion: A generalist modeling interface for vision tasks. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 12709–12720, 2024

work page 2024
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis, 2024

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. arXiv preprint arXiv:2412.04431, 2024

work page arXiv 2024
[23]

arXiv preprint arXiv:2410.15553 , year=

Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following. arXiv preprint arXiv:2410.15553, 2024

work page arXiv 2024
[25]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. ArXiv, abs/2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

work page 2022
[27]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8362–8371, 2024

work page 2024
[28]

arXiv preprint arXiv:2404.09990 , year=

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024

work page arXiv 2024
[29]

Moh: Multi-head attention as mixture-of-head attention

Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moh: Multi-head attention as mixture-of-head attention. arXiv preprint arXiv:2410.11842, 2024

work page arXiv 2024
[30]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023. 11

work page 2023
[31]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[32]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[33]

Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

Jari Korhonen and Junyong You. Peak signal-to-noise ratio revisited: Is simple beautiful? In 2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37–38, 2012

work page 2012
[34]

Learning action and reasoning-centric image editing from videos and simulation

Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Chris Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulation. Advances in Neural Information Processing Systems, 37:38035–38078, 2024

work page 2024
[35]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023

work page arXiv 2023
[36]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299–13308, 2024

work page 2024
[37]

Have we unified image generation and understanding yet? an empirical study of gpt-4o’s image generation ability

Ning Li, Jingran Zhang, and Justin Cui. Have we unified image generation and understanding yet? an empirical study of gpt-4o’s image generation ability. arXiv preprint arXiv:2504.08003, 2025

work page arXiv 2025
[38]

Instructany2pix: Flexible visual editing via multimodal instruction following

Shufan Li, Harkanwar Singh, and Aditya Grover. Instructany2pix: Flexible visual editing via multimodal instruction following. arXiv preprint arXiv:2312.06738, 2023

work page arXiv 2023
[39]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024

work page arXiv 2024
[40]

Moe-llava: Mix- ture of experts for large vision-language models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024

work page arXiv 2024
[41]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review arXiv 2023
[42]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

work page 2024
[43]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014
[44]

A multimodal dialogue system for conversational image editing

Tzu-Hsiang Lin, Trung Bui, Doo Soon Kim, and Jean Oh. A multimodal dialogue system for conversational image editing. arXiv preprint arXiv:2002.06484, 2020

work page arXiv 2002
[45]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[46]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, et al. Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms. arXiv preprint arXiv:2406.11833, 2024. 12

work page arXiv 2024
[48]

I2ebench: A comprehensive benchmark for instruction-based image editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction-based image editing. arXiv preprint arXiv:2408.14180, 2024

work page arXiv 2024
[49]

Superedit: Rectifying and facilitating supervision for instruction-based image editing

Li Ming, Gu Xin, Chen Fan, Xing Xiaoying, Wen Longyin, Chen Chen, and Zhu Sijie. Superedit: Rectifying and facilitating supervision for instruction-based image editing. arXiv preprint arXiv:2505.02370, 2025

work page arXiv 2025
[50]

Mishchenko and A

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv preprint arXiv:2306.06101, 2023

work page arXiv 2023
[51]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025

work page 2025
[52]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[54]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[55]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[56]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021
[57]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Internatio...

work page 2025
[58]

Laion- 5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[59]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024
[60]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing. arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024
[61]

Improving image captioning with better use of captions

Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807, 2020

work page arXiv 2006
[62]

arXiv preprint arXiv:2501.17399 , year=

Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. arXiv preprint arXiv:2501.17399, 2025

work page arXiv 2025
[63]

Journeydb: A benchmark for generative image understanding

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in neural information processing systems, 36:49659–49678, 2023. 13

work page 2023
[64]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

arXiv preprint arXiv:2411.15098 (2024)

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024
[66]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Action-based image editing guided by human instructions

Maria Mihaela Trusca, Mingxiao Li, and Marie-Francine Moens. Action-based image editing guided by human instructions. arXiv preprint arXiv:2412.04558, 2024

work page arXiv 2024
[68]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Flexedit: Marrying free-shape masks to vllm for flexible image editing

Jue Wang, Yuxiang Lin, Tianshuo Yuan, Zhi-Qi Cheng, Xiaolong Wang, Jiao GH, Wei Chen, and Xiaojiang Peng. Flexedit: Marrying free-shape masks to vllm for flexible image editing. arXiv preprint arXiv:2408.12429, 2024

work page arXiv 2024
[70]

Imagen editor and editbench: Advancing and evaluating text-guided image inpainting

Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359–18369, 2023

work page 2023
[71]

Genartist: Multimodal llm as an agent for unified image generation and editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. Advances in Neural Information Processing Systems, 37:128374–128395, 2024

work page 2024
[72]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004

work page 2004
[73]

Om- niedit: Building image editing generalist models through specialist supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Om- niedit: Building image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199, 2024

work page arXiv 2024
[74]

arXiv preprint arXiv:2412.08573 (2024)

Ioannis Xarchakos and Theodoros Koukopoulos. Tryoffanyone: Tiled cloth generation from a dressed person. arXiv preprint arXiv:2412.08573, 2024

work page arXiv 2024
[75]

Fakeshield: Explainable image forgery detection and localization via multi-modal large language models

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. arXiv preprint arXiv:2410.02761, 2024

work page arXiv 2024
[76]

Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation. arXiv preprint arXiv:2504.02782, 2025

work page arXiv 2025
[77]

A preliminary study for gpt-4o on image restoration

Hao Yang, Yan Yang, Ruikun Zhang, and Liyuan Pan. A preliminary study for gpt-4o on image restoration. arXiv preprint arXiv:2505.05621, 2025

work page arXiv 2025
[78]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. arXiv preprint arXiv:2411.15738, 2024

work page arXiv 2024
[80]

Promptfix: You prompt and we fix the photo

Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, and Jiebo Luo. Promptfix: You prompt and we fix the photo. arXiv preprint arXiv:2405.16785, 2024. 14

work page arXiv 2024
[81]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-preserving text-to-video generation by frequency decomposition. arXiv preprint arXiv:2411.17440, 2024

work page arXiv 2024
[82]

Cat-dm: Controllable accelerated virtual try-on with diffusion model

Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, and An-An Liu. Cat-dm: Controllable accelerated virtual try-on with diffusion model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024

work page 2024

Showing first 80 references.