Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.
super hub Canonical reference
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. I
authors
co-cited works
representative citing papers
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
DEMON is a streaming diffusion engine that exposes denoising parameters as playable controls at up to 12.3 decoder completions per second via per-slot scheduling, shared state, source blending, and accelerated decoding.
EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
PIU suppresses target identity generation in Arc2Face by replacing it with a proximity-selected anchor identity through localized fine-tuning of cross-attention layers while preserving output quality for other identities.
Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
UDAPose improves low-light human pose estimation by synthesizing realistic images via DHF and LCIM modules and dynamically balancing image cues with pose priors using DCA, yielding AP gains of 10.1 and 7.4 over prior methods.
NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
citing papers explorer
-
PhyDrawGen: Physically Grounded Diagram Generation from Natural Language
PhyDrawGen is a neuro-symbolic pipeline that extracts typed scene graphs via LLM, converts them to physically constrained PSLGs via deterministic solver, and refines via fine-tuned Qwen-VL, claiming superior performance over GPT-5-image and Gemini models on 1,449 physics problems.