pith. machine review for the scientific record. sign in

arxiv: 2308.06721 · v1 · submitted 2023-08-13 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 23:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image prompttext-to-image diffusionadaptercross-attentionmultimodal generationlightweight modelfrozen modelcompatible generation
0
0 comments X

The pith

A 22-million-parameter adapter adds image prompting to frozen text-to-image diffusion models while preserving text compatibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to equip existing text-to-image diffusion models with image prompt support using a small adapter module instead of full retraining. The central design separates cross-attention so that text features and image features are processed in distinct layers. This keeps the original model frozen and avoids interference between the two prompt types. As a result the adapter reaches or exceeds the output quality of much larger fine-tuned systems. Readers care because the approach makes image-guided generation practical on standard hardware and compatible with existing text workflows and control tools.

Core claim

IP-Adapter achieves image prompt capability for pretrained text-to-image diffusion models through a decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. With only 22M parameters and the base model frozen, the adapter matches or exceeds the performance of fully fine-tuned image prompt models. The separation enables the image prompt to work together with text prompts, to generalize across custom models derived from the same base, and to combine with existing structural control tools.

What carries the argument

Decoupled cross-attention mechanism that assigns separate cross-attention layers to text features and to image features.

If this is right

  • The image prompt combines directly with the text prompt to produce multimodal outputs.
  • The same adapter applies without modification to other models fine-tuned from the identical base model.
  • Existing structural control tools remain usable alongside the image prompt.
  • Image prompting becomes available without retraining the base model or using large compute resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation idea could be tested for adding other non-text signals such as depth maps or sketches.
  • Lightweight adapters of this type might reduce the overall need for repeated full-model fine-tuning across the field.
  • Users could experiment with mixing multiple reference images more freely once text and image pathways stay independent.

Load-bearing premise

Separating cross-attention layers for text and image features lets the adapter add image guidance without lowering the quality of the frozen model's original text-to-image generation.

What would settle it

A head-to-head test on standard image-prompt benchmarks where the 22M adapter produces measurably lower fidelity or weaker prompt adherence than a fully fine-tuned image-prompt baseline.

read the original abstract

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IP-Adapter, a lightweight (22M-parameter) adapter module for pretrained text-to-image diffusion models. It employs a decoupled cross-attention mechanism that routes image features through separate layers while keeping the base UNet frozen, enabling image prompting that is claimed to match or exceed fully fine-tuned image-prompt models. The design is presented as text-compatible, generalizable to other fine-tuned base models and controllable tools, and supportive of multimodal (text+image) generation.

Significance. If the empirical claims hold, the work is significant for demonstrating an efficient, plug-and-play adapter that avoids full-model fine-tuning while preserving compatibility with text conditioning and existing control mechanisms. The parameter efficiency and frozen-base strategy represent a practical contribution that could reduce compute barriers in customizing diffusion models. The project page link aids reproducibility and community adoption.

major comments (2)
  1. [Experiments] The central claim of text compatibility and non-interference with the frozen pretrained model rests on the decoupled cross-attention design. The manuscript does not report quantitative before/after comparisons (e.g., FID or CLIP score on standard text-only benchmarks such as COCO with image input disabled) after inserting and training the adapter layers, leaving the no-degradation assumption unverified in the experimental evaluation.
  2. [§4] §4 (results): The statement that the 22M-parameter IP-Adapter achieves 'comparable or even better performance' than fully fine-tuned image-prompt models requires explicit specification of the exact baselines, training compute, datasets, and metrics (including variance across runs) used in the comparison tables to substantiate the claim.
minor comments (2)
  1. The abstract would be strengthened by including one or two key quantitative results (e.g., specific FID or user-study scores) rather than qualitative claims alone.
  2. [Method] Notation for the decoupled cross-attention (e.g., separate Q/K/V projections for image vs. text) should be formalized with an equation in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The points raised highlight opportunities to strengthen the experimental validation of our text-compatibility claims and the performance comparisons. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] The central claim of text compatibility and non-interference with the frozen pretrained model rests on the decoupled cross-attention design. The manuscript does not report quantitative before/after comparisons (e.g., FID or CLIP score on standard text-only benchmarks such as COCO with image input disabled) after inserting and training the adapter layers, leaving the no-degradation assumption unverified in the experimental evaluation.

    Authors: We agree that explicit quantitative verification would strengthen the central claim. Although the base UNet remains frozen and the cross-attention is decoupled by design, we will add before/after experiments in the revised manuscript: we will report FID and CLIP scores on COCO for the original pretrained model versus the model with the trained IP-Adapter (image input disabled) to directly confirm no degradation in text-only performance. revision: yes

  2. Referee: [§4] §4 (results): The statement that the 22M-parameter IP-Adapter achieves 'comparable or even better performance' than fully fine-tuned image-prompt models requires explicit specification of the exact baselines, training compute, datasets, and metrics (including variance across runs) used in the comparison tables to substantiate the claim.

    Authors: We will revise §4 to explicitly enumerate the baselines (specific fine-tuned models and their training details), datasets, training compute, and metrics for each entry in the comparison tables. For variance, we will clarify that results are from single runs with fixed random seeds (standard practice when multiple runs are not reported) or add multi-run statistics if additional compute is feasible; this will provide full context for the 'comparable or better' statement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architectural proposal without derivations

full rationale

The paper presents IP-Adapter as a lightweight empirical design using decoupled cross-attention layers inserted into a frozen pretrained UNet, with no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs. The 22M-parameter adapter's performance claims rest on experimental comparisons rather than any self-referential mathematical construction or uniqueness theorem imported from prior author work. The approach is self-contained as a practical engineering choice for text-compatible image prompting, with no detectable reduction of outputs to inputs by definition or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution is an empirical adapter architecture with no mathematical derivations, axioms, or free parameters described in the abstract.

invented entities (1)
  • IP-Adapter no independent evidence
    purpose: Lightweight module to enable image prompting via decoupled cross-attention
    The adapter is the core proposed artifact; no independent evidence or external validation is provided in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1158 out tokens · 47223 ms · 2026-05-10T23:51:27.281343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Support-Conditioned Flow Matching Is Kernel Smoothing

    cs.LG 2026-05 accept novelty 8.0

    Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.

  2. DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generativ...

  3. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  4. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...

  5. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

  6. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  7. Follow the Mean: Reference-Guided Flow Matching

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.

  8. Follow the Mean: Reference-Guided Flow Matching

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow matching admits reference-guided control by shifting the conditional endpoint mean, enabling training-free steering of models like FLUX via example banks and a semi-parametric variant on DiT.

  9. Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories

    cs.CY 2026-05 unverdicted novelty 7.0

    Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.

  10. Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

    cs.CV 2026-05 unverdicted novelty 7.0

    Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

  11. Adaptive Subspace Projection for Generative Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.

  12. A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...

  13. CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

    cs.CV 2026-04 unverdicted novelty 7.0

    CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 ...

  14. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

  15. MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.

  16. StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

    cs.GR 2026-04 unverdicted novelty 7.0

    StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

  17. AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

    cs.MM 2026-04 unverdicted novelty 7.0

    AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...

  18. View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

    cs.CV 2026-04 unverdicted novelty 7.0

    A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

  19. ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.

  20. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  21. UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

    cs.CV 2026-04 conditional novelty 7.0

    UDAPose improves low-light human pose estimation by synthesizing realistic images via DHF and LCIM modules and dynamically balancing image cues with pose priors using DCA, yielding AP gains of 10.1 and 7.4 over prior methods.

  22. NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity

    cs.LG 2026-04 unverdicted novelty 7.0

    NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.

  23. Large-Scale Universal Defect Generation: Foundation Models and Datasets

    cs.CV 2026-04 unverdicted novelty 7.0

    A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.

  24. MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

    cs.GR 2026-04 unverdicted novelty 7.0

    MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.

  25. Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.

  26. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  27. CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.

  28. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  29. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  30. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  31. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  32. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.

  33. Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with neglig...

  34. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  35. Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    ODP-Net structurally disentangles universal forgery traces from generator fingerprints and semantics via orthogonal decomposition and purification, delivering state-of-the-art generalization to unseen AI image generat...

  36. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  37. Advancing Aesthetic Image Generation via Composition Transfer

    cs.CV 2026-05 unverdicted novelty 6.0

    Composer enables semantic-agnostic composition transfer from references and theme-driven planning via LVLMs to improve aesthetic quality in diffusion-based image generation.

  38. Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.

  39. MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    MooD is the first framework to use continuous Valence-Arousal values for fine-grained affective image editing via a VA-aware retrieval strategy, visual transfer, semantic guidance, and the new AffectSet dataset.

  40. MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.

  41. Disentangled Anatomy-Disease Diffusion (DADD) for Controllable Ulcerative Colitis Progression Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    DADD disentangles anatomy and disease in a latent diffusion model using a Feature Purifier, ordinal disease embeddings, and Delta Steering to synthesize controllable ulcerative colitis progression images.

  42. Map2World: Segment Map Conditioned Text to 3D World Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Map2World produces scale-consistent 3D worlds from text and arbitrary segment maps via a detail enhancer that incorporates global structure information.

  43. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  44. ViPO: Visual Preference Optimization at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

  45. DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

    cs.CV 2026-04 unverdicted novelty 6.0

    DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

  46. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  47. ReContraster: Making Your Posters Stand Out with Regional Contrast

    cs.CV 2026-04 unverdicted novelty 6.0

    ReContraster is the first training-free model that applies regional contrast via a compositional multi-agent system and hybrid denoising during diffusion to generate visually striking posters, supported by a new bench...

  48. SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    SIC3D generates text-to-3D objects with Gaussian splatting then stylizes them using Variational Stylized Score Distillation loss plus scaling regularization to improve style match and geometry fidelity.

  49. What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

    cs.CV 2026-04 accept novelty 6.0

    A Dual-UNet diffusion model for virtual garment reconstruction from clothed images sets new benchmarks on VITON-HD and DressCode by optimizing Stable Diffusion variants, mask conditioning, and auxiliary losses.

  50. Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    A meta-optimized in-context learning approach enables training-free cross-subject semantic visual decoding from fMRI by inferring individual neural encoding patterns via hierarchical inference on a few examples.

  51. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  52. Multimodal Large Language Models for Multi-Subject In-Context Image Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...

  53. VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

  54. SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

  55. StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

    cs.CV 2026-04 unverdicted novelty 6.0

    StoryBlender generates inter-shot consistent editable 3D storyboards using a three-stage pipeline of semantic-spatial grounding, canonical asset materialization, and spatial-temporal dynamics with agent-based verification.

  56. MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

    cs.CV 2026-03 unverdicted novelty 6.0

    MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.

  57. Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Premier learns user-specific embeddings to modulate text-to-image generation, outperforming prior methods on preference alignment, text consistency, and expert ratings even with limited history.

  58. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    cs.GR 2025-06 unverdicted novelty 6.0

    FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.

  59. ImgEdit: A Unified Image Editing Dataset and Benchmark

    cs.CV 2025-05 conditional novelty 6.0

    ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

  60. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 70 Pith papers · 16 internal anchors

  1. [1]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  2. [2]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  3. [3]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems , 35:36479–36494, 2022

  4. [4]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  5. [5]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

  6. [6]

    Raphael: Text-to-image generation via large mixture of diffusion paths

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to- image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023

  7. [7]

    Investigating prompt engineering in diffusion models

    Sam Witteveen and Martin Andrews. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022

  8. [8]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

  9. [9]

    Adding conditional control to text-to-image diffusion models,

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

  10. [10]

    Prompt-free diffusion: Taking" text" out of text-to-image diffusion models

    Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, and Humphrey Shi. Prompt-free diffusion: Taking" text" out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023

  11. [11]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023

  12. [12]

    Uni-controlnet: All-in-one control to text-to-image diffusion models

    Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023

  13. [13]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning , pages 8821–

  14. [14]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021. 14

  15. [15]

    Cogview2: Faster and better text-to-image generation via hierarchical transformers

    Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems , 35:16890–16902, 2022

  16. [16]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022

  17. [17]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

  18. [18]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

  19. [19]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

  20. [20]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015

  21. [21]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  22. [22]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

  23. [23]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  24. [24]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

  25. [25]

    Re-imagen: Retrieval-augmented text-to-image generator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to- image generator. arXiv preprint arXiv:2209.14491, 2022

  26. [26]

    Versatile diffusion: Text, images and variations all in one diffusion model.arXiv preprint arXiv:2211.08332, 2022

    Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022

  27. [27]

    Com- poser: Creative and controllable image synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023

  28. [28]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  29. [29]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research , 23(1):5232–5270, 2022

  30. [30]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  31. [31]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  32. [32]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  33. [33]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023

  34. [34]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

  35. [35]

    What matters in training a gpt4-style language model with mul- timodal inputs? arXiv preprint arXiv:2307.02469 , 2023

    Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023. 15

  36. [36]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022

  37. [37]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems , 35:5775–5787, 2022

  38. [38]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

  39. [39]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  40. [40]

    U-net: Convolutional networks for biomedical image seg- mentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 , pages 234–241. Springer, 2015

  41. [41]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  42. [42]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022

  43. [43]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

  44. [44]

    Openclip

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. https://github.com/mlfoundations/open_clip, 2021

  45. [45]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

  46. [46]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  47. [47]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014

  48. [48]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

  49. [49]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022

  50. [50]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021

  51. [51]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

  52. [52]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 16