super hub Canonical reference

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Wei Yang, Xiao Han · 2023 · cs.CV · arXiv 2308.06721

Canonical reference. 90% of citing Pith papers cite this work as background.

148 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 148 citing papers more from Hu Ye arXiv PDF

abstract

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 26 dataset 3 method 1

citation-polarity summary

background 27 use dataset 2 use method 1

claims ledger

abstract Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. I

authors

Hu Ye Jun Zhang Sibo Liu Wei Yang Xiao Han

co-cited works

representative citing papers

Support-Conditioned Flow Matching Is Kernel Smoothing

cs.LG · 2026-05-13 · accept · novelty 8.0

Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

DEMON: Diffusion Engine for Musical Orchestrated Noise

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

DEMON is a streaming diffusion engine that exposes denoising parameters as playable controls at up to 12.3 decoder completions per second via per-slot scheduling, shared state, source blending, and accelerated decoding.

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

PIU suppresses target identity generation in Arc2Face by replacing it with a proximity-selected anchor identity through localized fine-tuning of cross-attention layers while preserving output quality for other identities.

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories

cs.CY · 2026-05-09 · unverdicted · novelty 7.0

Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

Adaptive Subspace Projection for Generative Personalization

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.

CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0 · 2 refs

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

cs.GR · 2026-04-23 · unverdicted · novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

cs.MM · 2026-04-22 · unverdicted · novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.

View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

citing papers explorer

Showing 50 of 109 citing papers after filters.

Support-Conditioned Flow Matching Is Kernel Smoothing cs.LG · 2026-05-13 · accept · none · ref 27 · internal anchor
Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.
Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo cs.CV · 2026-05-31 · unverdicted · none · ref 17 · internal anchor
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing cs.CV · 2026-05-31 · unverdicted · none · ref 15 · internal anchor
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
DEMON: Diffusion Engine for Musical Orchestrated Noise cs.SD · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
DEMON is a streaming diffusion engine that exposes denoising parameters as playable controls at up to 12.3 decoder completions per second via per-slot scheduling, shared state, source blending, and accelerated decoding.
Loki: Representation over Architecture for Diffusion-Based Portrait Animation cs.CV · 2026-05-22 · unverdicted · none · ref 30 · internal anchor
Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 20 · internal anchor
EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models cs.CV · 2026-05-21 · unverdicted · none · ref 47 · internal anchor
PIU suppresses target identity generation in Arc2Face by replacing it with a proximity-selected anchor identity through localized fine-tuning of cross-attention layers while preserving output quality for other identities.
Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision cs.CV · 2026-05-19 · unverdicted · none · ref 30 · internal anchor
Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 69 · internal anchor
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport cs.CV · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 48 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unverdicted · none · ref 28 · 2 links · internal anchor
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics cs.CV · 2026-05-12 · unverdicted · none · ref 42 · 2 links · internal anchor
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories cs.CY · 2026-05-09 · unverdicted · none · ref 76 · internal anchor
Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision cs.CV · 2026-05-08 · unverdicted · none · ref 55 · internal anchor
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
Adaptive Subspace Projection for Generative Personalization cs.CV · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping cs.CV · 2026-05-06 · unverdicted · none · ref 31 · internal anchor
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping cs.CV · 2026-04-27 · unverdicted · none · ref 28 · internal anchor
CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CV · 2026-04-26 · unverdicted · none · ref 50 · 2 links · internal anchor
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition cs.GR · 2026-04-23 · unverdicted · none · ref 23 · internal anchor
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe cs.MM · 2026-04-22 · unverdicted · none · ref 76 · internal anchor
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding cs.CV · 2026-04-15 · unverdicted · none · ref 46 · internal anchor
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors cs.CV · 2026-04-14 · unverdicted · none · ref 33 · internal anchor
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation cs.CV · 2026-04-12 · conditional · none · ref 76 · internal anchor
UDAPose improves low-light human pose estimation by synthesizing realistic images via DHF and LCIM modules and dynamically balancing image cues with pose priors using DCA, yielding AP gains of 10.1 and 7.4 over prior methods.
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity cs.LG · 2026-04-10 · unverdicted · none · ref 67 · internal anchor
NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
Large-Scale Universal Defect Generation: Foundation Models and Datasets cs.CV · 2026-04-10 · unverdicted · none · ref 10 · internal anchor
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation cs.GR · 2026-04-08 · unverdicted · none · ref 49 · internal anchor
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors cs.CV · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.
NURBS Splatting: A Unified Differentiable Rendering Framework for Vector Graphics cs.GR · 2026-06-30 · unverdicted · none · ref 47 · internal anchor
NURBS Splatting represents rational splines as continuous Gaussian fields sampled along the curve to enable stable differentiable rendering of vector graphics.
Histogram-constrained Image Generation cs.CV · 2026-06-30 · unverdicted · none · ref 69 · internal anchor
HIG enforces exact histogram constraints on diffusion-generated images by modeling the control task as an optimal transport problem and applying guidance transformations during sampling.
GarmentZoom: Generating Zoomable Images from Garment Listings cs.CV · 2026-06-28 · unverdicted · none · ref 26 · internal anchor
GarmentZoom trains one model to synthesize unaligned close-up details into full-view garment images across continuous scales 3-20x without per-instance tuning.
COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models cs.AI · 2026-06-27 · unverdicted · none · ref 37 · internal anchor
COMPASS is a unified multimodal framework using a shared expert token τ_c to ground composition-intent for both perception and generation, backed by the new Comp-11 dataset.
Controllable Texture Tiling with Transformed RoPE-Enhanced Diffusion Models cs.GR · 2026-06-22 · unverdicted · none · ref 3 · internal anchor
A Diffusion Transformer framework applies coordinate-transformed RoPE and disjoint attention masks to achieve controllable, high-fidelity texture tiling that preserves reference structure and scene lighting.
IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation cs.CV · 2026-05-28 · unverdicted · none · ref 43 · internal anchor
A fine-tuning-free framework combines pretrained Stable Diffusion with IP-Adapter plus three parameter-free modules to achieve improved lip synchronization and visual quality in talking face generation.
Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer cs.CV · 2026-05-26 · conditional · none · ref 41 · internal anchor
Scheduled decreasing style injection across decoder layers and denoising timesteps, combined with ControlNet scheduling, expands the style-content tradeoff frontier and achieves 6.1% better ArtFID than StyleID across 28k images.
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation cs.CV · 2026-05-25 · unverdicted · none · ref 50 · internal anchor
A multi-teacher distillation framework that packs 50 effect LoRAs and fast sampling into a single adapter while aiming to avoid concept interference.
Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference cs.CV · 2026-05-24 · unverdicted · none · ref 28 · internal anchor
VCF aligns CLIP image features to text embeddings via a lightweight aligner to enable dual image-text conditioning in Stable Diffusion at inference without concept-specific training.
Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing cs.CV · 2026-05-23 · unverdicted · none · ref 53 · internal anchor
RVEDiT improves DiT-based video editing by granularity-routed token conditioning and reference-anchored attention alignment to achieve better temporal coherence and localized edits.
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 39 · internal anchor
AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers cs.CV · 2026-05-17 · unverdicted · none · ref 56 · internal anchor
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 78 · internal anchor
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression cs.CV · 2026-05-15 · unverdicted · none · ref 47 · internal anchor
AdaEraser introduces token-wise adaptive attention suppression in diffusion denoising to enable high-quality training-free object removal by modulating suppression according to evolving self-attention maps.
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
MaTe proposes a training-free diffusion transformer that performs material transfer using only images by integrating them at the token level for unified multi-modal attention in a shared latent space.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 42 · 2 links · internal anchor
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Follow the Mean: Reference-Guided Flow Matching cs.LG · 2026-05-11 · unverdicted · none · ref 37 · 3 links · internal anchor
Flow matching velocity fields are governed solely by conditional endpoint means, so changing the reference-set mean steers generation without parameter updates.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition cs.CV · 2026-05-11 · unverdicted · none · ref 54 · 2 links · internal anchor
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with negligible hallucination.

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer