IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye; Jun Zhang; Sibo Liu; Wei Yang; Xiao Han

arxiv: 2308.06721 · v1 · submitted 2023-08-13 · 💻 cs.CV · cs.AI

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye , Jun Zhang , Sibo Liu , Xiao Han , Wei Yang This is my paper

Pith reviewed 2026-05-10 23:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image prompttext-to-image diffusionadaptercross-attentionmultimodal generationlightweight modelfrozen modelcompatible generation

0 comments

The pith

A 22-million-parameter adapter adds image prompting to frozen text-to-image diffusion models while preserving text compatibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to equip existing text-to-image diffusion models with image prompt support using a small adapter module instead of full retraining. The central design separates cross-attention so that text features and image features are processed in distinct layers. This keeps the original model frozen and avoids interference between the two prompt types. As a result the adapter reaches or exceeds the output quality of much larger fine-tuned systems. Readers care because the approach makes image-guided generation practical on standard hardware and compatible with existing text workflows and control tools.

Core claim

IP-Adapter achieves image prompt capability for pretrained text-to-image diffusion models through a decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. With only 22M parameters and the base model frozen, the adapter matches or exceeds the performance of fully fine-tuned image prompt models. The separation enables the image prompt to work together with text prompts, to generalize across custom models derived from the same base, and to combine with existing structural control tools.

What carries the argument

Decoupled cross-attention mechanism that assigns separate cross-attention layers to text features and to image features.

If this is right

The image prompt combines directly with the text prompt to produce multimodal outputs.
The same adapter applies without modification to other models fine-tuned from the identical base model.
Existing structural control tools remain usable alongside the image prompt.
Image prompting becomes available without retraining the base model or using large compute resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation idea could be tested for adding other non-text signals such as depth maps or sketches.
Lightweight adapters of this type might reduce the overall need for repeated full-model fine-tuning across the field.
Users could experiment with mixing multiple reference images more freely once text and image pathways stay independent.

Load-bearing premise

Separating cross-attention layers for text and image features lets the adapter add image guidance without lowering the quality of the frozen model's original text-to-image generation.

What would settle it

A head-to-head test on standard image-prompt benchmarks where the 22M adapter produces measurably lower fidelity or weaker prompt adherence than a fully fine-tuned image-prompt baseline.

read the original abstract

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IP-Adapter, a lightweight (22M-parameter) adapter module for pretrained text-to-image diffusion models. It employs a decoupled cross-attention mechanism that routes image features through separate layers while keeping the base UNet frozen, enabling image prompting that is claimed to match or exceed fully fine-tuned image-prompt models. The design is presented as text-compatible, generalizable to other fine-tuned base models and controllable tools, and supportive of multimodal (text+image) generation.

Significance. If the empirical claims hold, the work is significant for demonstrating an efficient, plug-and-play adapter that avoids full-model fine-tuning while preserving compatibility with text conditioning and existing control mechanisms. The parameter efficiency and frozen-base strategy represent a practical contribution that could reduce compute barriers in customizing diffusion models. The project page link aids reproducibility and community adoption.

major comments (2)

[Experiments] The central claim of text compatibility and non-interference with the frozen pretrained model rests on the decoupled cross-attention design. The manuscript does not report quantitative before/after comparisons (e.g., FID or CLIP score on standard text-only benchmarks such as COCO with image input disabled) after inserting and training the adapter layers, leaving the no-degradation assumption unverified in the experimental evaluation.
[§4] §4 (results): The statement that the 22M-parameter IP-Adapter achieves 'comparable or even better performance' than fully fine-tuned image-prompt models requires explicit specification of the exact baselines, training compute, datasets, and metrics (including variance across runs) used in the comparison tables to substantiate the claim.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., specific FID or user-study scores) rather than qualitative claims alone.
[Method] Notation for the decoupled cross-attention (e.g., separate Q/K/V projections for image vs. text) should be formalized with an equation in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The points raised highlight opportunities to strengthen the experimental validation of our text-compatibility claims and the performance comparisons. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] The central claim of text compatibility and non-interference with the frozen pretrained model rests on the decoupled cross-attention design. The manuscript does not report quantitative before/after comparisons (e.g., FID or CLIP score on standard text-only benchmarks such as COCO with image input disabled) after inserting and training the adapter layers, leaving the no-degradation assumption unverified in the experimental evaluation.

Authors: We agree that explicit quantitative verification would strengthen the central claim. Although the base UNet remains frozen and the cross-attention is decoupled by design, we will add before/after experiments in the revised manuscript: we will report FID and CLIP scores on COCO for the original pretrained model versus the model with the trained IP-Adapter (image input disabled) to directly confirm no degradation in text-only performance. revision: yes
Referee: [§4] §4 (results): The statement that the 22M-parameter IP-Adapter achieves 'comparable or even better performance' than fully fine-tuned image-prompt models requires explicit specification of the exact baselines, training compute, datasets, and metrics (including variance across runs) used in the comparison tables to substantiate the claim.

Authors: We will revise §4 to explicitly enumerate the baselines (specific fine-tuned models and their training details), datasets, training compute, and metrics for each entry in the comparison tables. For variance, we will clarify that results are from single runs with fixed random seeds (standard practice when multiple runs are not reported) or add multi-run statistics if additional compute is feasible; this will provide full context for the 'comparable or better' statement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architectural proposal without derivations

full rationale

The paper presents IP-Adapter as a lightweight empirical design using decoupled cross-attention layers inserted into a frozen pretrained UNet, with no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. The 22M-parameter adapter's performance claims rest on experimental comparisons rather than any self-referential mathematical construction or uniqueness theorem imported from prior author work. The approach is self-contained as a practical engineering choice for text-compatible image prompting, with no detectable reduction of outputs to inputs by definition or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution is an empirical adapter architecture with no mathematical derivations, axioms, or free parameters described in the abstract.

invented entities (1)

IP-Adapter no independent evidence
purpose: Lightweight module to enable image prompting via decoupled cross-attention
The adapter is the core proposed artifact; no independent evidence or external validation is provided in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1158 out tokens · 47223 ms · 2026-05-10T23:51:27.281343+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Support-Conditioned Flow Matching Is Kernel Smoothing
cs.LG 2026-05 accept novelty 8.0

Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

PIU suppresses target identity generation in Arc2Face by replacing it with a proximity-selected anchor identity through localized fine-tuning of cross-attention layers while preserving output quality for other identities.
Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision
cs.CV 2026-05 unverdicted novelty 7.0

Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.
Functionalization via Structure Completion and Motion Rectification
cs.CV 2026-05 unverdicted novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport
cs.CV 2026-05 unverdicted novelty 7.0

DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generativ...
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits reference-guided control by shifting the conditional endpoint mean, enabling training-free steering of models like FLUX via example banks and a semi-parametric variant on DiT.
Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories
cs.CY 2026-05 unverdicted novelty 7.0

Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
cs.CV 2026-05 unverdicted novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
Adaptive Subspace Projection for Generative Personalization
cs.CV 2026-05 unverdicted novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
cs.CV 2026-05 unverdicted novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
cs.CV 2026-04 unverdicted novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
cs.CV 2026-04 unverdicted novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
cs.CV 2026-04 unverdicted novelty 7.0

ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
cs.CV 2026-04 conditional novelty 7.0

UDAPose improves low-light human pose estimation by synthesizing realistic images via DHF and LCIM modules and dynamically balancing image cues with pose priors using DCA, yielding AP gains of 10.1 and 7.4 over prior methods.
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
cs.LG 2026-04 unverdicted novelty 7.0

NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
Large-Scale Universal Defect Generation: Foundation Models and Datasets
cs.CV 2026-04 unverdicted novelty 7.0

A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
cs.GR 2026-04 unverdicted novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors
cs.CV 2026-04 unverdicted novelty 7.0

Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.
DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.
VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image
cs.CV 2026-02 unverdicted novelty 7.0

VecSet-Edit is the first method to perform high-fidelity mesh editing from a single image by analyzing and manipulating spatial token subsets in a pre-trained VecSet LRM.
Setting the Stage: Text-Driven Scene-Consistent Image Generation
cs.CV 2025-12 conditional novelty 7.0

A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
cs.CV 2025-12 unverdicted novelty 7.0

Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
RDSplat: Robust Watermarking for 3D Gaussian Splatting Against 2D and 3D Diffusion Editing
cs.CV 2025-12 conditional novelty 7.0

RDSplat is the first 3D Gaussian Splatting watermarking method that maintains 0.701 bit accuracy against both 2D and 3D diffusion editing by embedding only in low-frequency primitives selected via FAPS.
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
cs.CV 2025-11 unverdicted novelty 7.0

One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and t...
ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On
cs.CV 2025-09 unverdicted novelty 7.0

ART-VITON uses residual prior initialization and artifact-free measurement-guided sampling with data consistency, frequency correction, and periodic denoising to generate artifact-free virtual try-on images on VITON-H...
Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
cs.CV 2025-09 conditional novelty 7.0

Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing
cs.CV 2025-07 unverdicted novelty 7.0

ChangeBridge introduces a drift-asynchronous diffusion bridge with composed initialization, pixel-wise drift maps, and drift-aware denoising to produce spatially and temporally coherent post-event remote sensing images.
PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models
cs.CV 2025-05 unverdicted novelty 7.0

PacTure uses view packing and next-scale autoregressive prediction to generate consistent multi-view PBR textures faster than prior sequential or cross-attention methods.
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
cs.CV 2025-04 unverdicted novelty 7.0

UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
cs.CV 2025-02 unverdicted novelty 7.0

Gungnir shows that style-based triggers with RAN and STTR techniques can activate backdoors in diffusion models while evading detection and surviving fine-tuning.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding
eess.IV 2026-05 unverdicted novelty 6.0

STAMBRIDGE proposes STAM for conditioned EEG representations via soft amplitude weighting and MFSB for staged semantic bridging, reporting 34.50% Top-1 accuracy in 200-way zero-shot retrieval on THINGS-EEG.
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression
cs.CV 2026-05 unverdicted novelty 6.0

AdaEraser introduces token-wise adaptive attention suppression in diffusion denoising to enable high-quality training-free object removal by modulating suppression according to evolving self-attention maps.
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 6.0

MaTe proposes a training-free diffusion transformer that performs material transfer using only images by integrating them at the token level for unified multi-modal attention in a shared latent space.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with neglig...
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
cs.CV 2026-05 unverdicted novelty 6.0

ODP-Net structurally disentangles universal forgery traces from generator fingerprints and semantics via orthogonal decomposition and purification, delivering state-of-the-art generalization to unseen AI image generat...
Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
cs.CV 2026-05 unverdicted novelty 6.0

ODP-Net uses instance-aware orthogonal decomposition, perturbation-based purification, and manifold alignment to separate universal forgery traces, generator fingerprints, and semantics, achieving SOTA on unseen archi...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 123 Pith papers · 19 internal anchors

[1]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021
[2]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems , 35:36479–36494, 2022

work page 2022
[4]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[5]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022
[6]

Raphael: Text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023

work page arXiv 2023
[7]

Investigating prompt engineering in diffusion models

Sam Witteveen and Martin Andrews. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022

work page arXiv 2022
[8]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021
[9]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

work page internal anchor Pith review arXiv 2023
[10]

Prompt-free diffusion: Taking "text" out of text-to-image diffusion models

Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, and Humphrey Shi. Prompt-free diffusion: Taking" text" out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023

work page arXiv 2023
[11]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023

work page internal anchor Pith review arXiv 2023
[12]

Uni-controlnet: All-in-one control to text-to-image diffusion models

Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023

work page arXiv 2023
[13]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning , pages 8821–

work page
[14]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 14

work page 2021
[15]

Cogview2: Faster and better text-to-image generation via hierarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems , 35:16890–16902, 2022

work page 2022
[16]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022

work page 2022
[17]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017
[18]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017
[19]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

work page internal anchor Pith review arXiv 2022
[20]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[21]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[23]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021
[24]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

work page 2020
[25]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to- image generator. arXiv preprint arXiv:2209.14491, 2022

work page arXiv 2022
[26]

Versatile diffusion: Text, images and variations all in one diffusion model,

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022

work page arXiv 2022
[27]

Com- poser: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023

work page arXiv 2023
[28]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research , 23(1):5232–5270, 2022

work page 2022
[30]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019
[31]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review arXiv 2023
[32]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

work page Pith review arXiv 2023
[34]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review arXiv 2023
[35]

What matters in training a gpt4-style language model with multi- modal inputs? ArXiv, abs/2307.02469, 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023. 15

work page arXiv 2023
[36]

Pseudo numerical methods for diffusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022

work page arXiv 2022
[37]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787, 2022

work page 2022
[38]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review arXiv 2022
[39]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

U-net: Convolutional networks for biomedical image seg- mentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 , pages 234–241. Springer, 2015

work page 2015
[41]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022

work page 2022
[43]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[44]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://github.com/mlfoundations/open_clip, 2021

work page 2021
[45]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

work page 2022
[46]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014

work page 2014
[48]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review arXiv 2021
[49]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022

work page 2022
[50]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review arXiv 2021
[51]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022
[52]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 16

work page 2023

[1] [1]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021

[2] [2]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems , 35:36479–36494, 2022

work page 2022

[4] [4]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[5] [5]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022

[6] [6]

Raphael: Text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023

work page arXiv 2023

[7] [7]

Investigating prompt engineering in diffusion models

Sam Witteveen and Martin Andrews. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022

work page arXiv 2022

[8] [8]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021

[9] [9]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

work page internal anchor Pith review arXiv 2023

[10] [10]

Prompt-free diffusion: Taking "text" out of text-to-image diffusion models

Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, and Humphrey Shi. Prompt-free diffusion: Taking" text" out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023

work page arXiv 2023

[11] [11]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023

work page internal anchor Pith review arXiv 2023

[12] [12]

Uni-controlnet: All-in-one control to text-to-image diffusion models

Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023

work page arXiv 2023

[13] [13]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning , pages 8821–

work page

[14] [14]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 14

work page 2021

[15] [15]

Cogview2: Faster and better text-to-image generation via hierarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems , 35:16890–16902, 2022

work page 2022

[16] [16]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022

work page 2022

[17] [17]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

work page 2017

[18] [18]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

work page 2017

[19] [19]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

work page internal anchor Pith review arXiv 2022

[20] [20]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[21] [21]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[22] [22]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[23] [23]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021

[24] [24]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

work page 2020

[25] [25]

Re-imagen: Retrieval-augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to- image generator. arXiv preprint arXiv:2209.14491, 2022

work page arXiv 2022

[26] [26]

Versatile diffusion: Text, images and variations all in one diffusion model,

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022

work page arXiv 2022

[27] [27]

Com- poser: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023

work page arXiv 2023

[28] [28]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research , 23(1):5232–5270, 2022

work page 2022

[30] [30]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019

[31] [31]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review arXiv 2023

[32] [32]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

work page Pith review arXiv 2023

[34] [34]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review arXiv 2023

[35] [35]

What matters in training a gpt4-style language model with multi- modal inputs? ArXiv, abs/2307.02469, 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023. 15

work page arXiv 2023

[36] [36]

Pseudo numerical methods for diffusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022

work page arXiv 2022

[37] [37]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787, 2022

work page 2022

[38] [38]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review arXiv 2022

[39] [39]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

U-net: Convolutional networks for biomedical image seg- mentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 , pages 234–241. Springer, 2015

work page 2015

[41] [41]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[42] [42]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022

work page 2022

[43] [43]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022

[44] [44]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://github.com/mlfoundations/open_clip, 2021

work page 2021

[45] [45]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

work page 2022

[46] [46]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014

work page 2014

[48] [48]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review arXiv 2021

[49] [49]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022

work page 2022

[50] [50]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review arXiv 2021

[51] [51]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022

[52] [52]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 16

work page 2023