hub Canonical reference

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li · 2024 · cs.CV · arXiv 2401.07519

Canonical reference. 75% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 1 other 1

citation-polarity summary

background 6 unclear 1 use dataset 1

representative citing papers

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.

Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Introduces BIP framework and GapGen generator to allocate and synthesize millions of non-colliding virtual face identities within gaps of the real face manifold.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Adaptive Subspace Projection for Generative Personalization

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

cs.GR · 2026-04-23 · unverdicted · novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.

Setting the Stage: Text-Driven Scene-Consistent Image Generation

cs.CV · 2025-12-14 · conditional · novelty 7.0

A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer

cs.CV · 2025-09-04 · conditional · novelty 7.0

Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.

VACE: All-in-One Video Creation and Editing

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

cs.CV · 2026-05-14 · conditional · novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

ODP-Net uses instance-aware orthogonal decomposition, perturbation-based purification, and manifold alignment to separate universal forgery traces, generator fingerprints, and semantics, achieving SOTA on unseen architectures like Stable Diffusion 3.

DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

cs.CV · 2026-02-28 · unverdicted · novelty 6.0

IdGlow is a progressive two-stage diffusion framework that uses task-adaptive timestep scheduling, temporal gating, VLM prompt synthesis, and group-level DPO to balance identity preservation and scene coherence in multi-subject image generation.

How Noise Benefits AI-generated Image Detection

cs.CV · 2025-11-20 · unverdicted · novelty 6.0

PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5.4 points on images from 42 generative models.

Adversarial Concept Distillation for One-Step Diffusion Personalization

cs.CV · 2025-10-23 · unverdicted · novelty 6.0

OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.

NullFace: Training-Free Localized Face Anonymization

cs.CV · 2025-03-11 · unverdicted · novelty 6.0

NullFace performs training-free localized face anonymization by inverting images to noise and denoising with modified identity embeddings from a pre-trained diffusion model.

PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

PGC introduces peak-focusing aggregation of local discriminative clues to calibrate global representations for AI-generated image detection, reporting accuracy gains on a new 15-model commercial benchmark and standard datasets.

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.

citing papers explorer

Showing 29 of 29 citing papers.

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo cs.CV · 2026-05-31 · unverdicted · none · ref 94 · internal anchor
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning cs.CV · 2026-05-18 · unverdicted · none · ref 34 · internal anchor
Introduces BIP framework and GapGen generator to allocate and synthesize millions of non-colliding virtual face identities within gaps of the real face manifold.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Adaptive Subspace Projection for Generative Personalization cs.CV · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition cs.GR · 2026-04-23 · unverdicted · none · ref 21 · internal anchor
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 64 · internal anchor
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 34 · internal anchor
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer cs.CV · 2025-09-04 · conditional · none · ref 16 · internal anchor
Durian introduces a dual-reference diffusion model trained via self-reconstruction on video frames to enable cross-identity attribute transfer in portrait animations, supporting multi-attribute composition and interpolation.
VACE: All-in-One Video Creation and Editing cs.CV · 2025-03-10 · unverdicted · none · ref 67 · internal anchor
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 34 · internal anchor
AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.
VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers cs.CV · 2026-05-17 · unverdicted · none · ref 50 · internal anchor
VISTA introduces a new synthetic triplet dataset and diffusion-transformer framework with style adapter that jointly models style, content, and motion to achieve state-of-the-art video style transfer.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation cs.CV · 2026-05-14 · conditional · none · ref 46 · internal anchor
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection cs.CV · 2026-05-08 · unverdicted · none · ref 32 · 2 links · internal anchor
ODP-Net uses instance-aware orthogonal decomposition, perturbation-based purification, and manifold alignment to separate universal forgery traces, generator fingerprints, and semantics, achieving SOTA on unseen architectures like Stable Diffusion 3.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior cs.CV · 2026-04-19 · unverdicted · none · ref 44 · internal anchor
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 41 · internal anchor
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.
IdGlow: Dynamic Identity Modulation for Multi-Subject Generation cs.CV · 2026-02-28 · unverdicted · none · ref 31 · internal anchor
IdGlow is a progressive two-stage diffusion framework that uses task-adaptive timestep scheduling, temporal gating, VLM prompt synthesis, and group-level DPO to balance identity preservation and scene coherence in multi-subject image generation.
How Noise Benefits AI-generated Image Detection cs.CV · 2025-11-20 · unverdicted · none · ref 64 · internal anchor
PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5.4 points on images from 42 generative models.
Adversarial Concept Distillation for One-Step Diffusion Personalization cs.CV · 2025-10-23 · unverdicted · none · ref 89 · internal anchor
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
NullFace: Training-Free Localized Face Anonymization cs.CV · 2025-03-11 · unverdicted · none · ref 74 · internal anchor
NullFace performs training-free localized face anonymization by inverting images to noise and denoising with modified identity embeddings from a pre-trained diffusion model.
PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection cs.CV · 2026-05-20 · unverdicted · none · ref 7 · internal anchor
PGC introduces peak-focusing aggregation of local discriminative clues to calibrate global representations for AI-generated image detection, reporting accuracy gains on a new 15-model commercial benchmark and standard datasets.
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 25 · internal anchor
A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.
DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing cs.CV · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation cs.CV · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation cs.CV · 2026-05-10 · unverdicted · none · ref 11 · internal anchor
Frozen identity adapter from FLUX dev works on distilled schnell model, enabling 5.9x faster generation with better identity preservation in few steps.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards cs.CV · 2025-12-01 · conditional · none · ref 34 · internal anchor
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations cs.CV · 2026-04-17 · unverdicted · none · ref 54 · internal anchor
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
Follow the Mean: Reference-Guided Flow Matching cs.LG · 2026-05-11 · unreviewed · ref 39 · 2 links · internal anchor
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling cs.CV · 2025-12-14 · unreviewed · ref 34 · internal anchor

InstantID: Zero-shot Identity-Preserving Generation in Seconds

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer