Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang · 2024 · arXiv 2406.07209

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

representative citing papers

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.

LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization

cs.GR · 2026-01-08 · unverdicted · novelty 7.0

LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

cs.CV · 2024-12-05 · unverdicted · novelty 7.0

T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.

Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

cs.CV · 2025-04-23 · unverdicted · novelty 6.0

A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

cs.CV · 2025-04-22 · unverdicted · novelty 6.0

FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

cs.CV · 2025-12-01 · conditional · novelty 5.0

A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.

Animalbooth: multimodal feature enhancement for animal subject personalization

cs.CV · 2025-09-20 · unverdicted · novelty 5.0

AnimalBooth introduces an Animal Net, adaptive attention module, and frequency-controlled DCT feature integration to improve identity preservation and perceptual quality in personalized animal image generation, supported by a new high-resolution dataset AnimalBench.

UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization

cs.CV · 2026-05-29 · unverdicted · novelty 4.0

UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.

DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation

cs.CV · 2026-03-09

citing papers explorer

Showing 9 of 9 citing papers after filters.

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 17
EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding cs.CV · 2026-04-15 · unverdicted · none · ref 40
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization cs.GR · 2026-01-08 · unverdicted · none · ref 42
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 52
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale cs.CV · 2026-05-26 · unverdicted · none · ref 52
Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute cs.CV · 2025-04-23 · unverdicted · none · ref 48
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation cs.CV · 2025-04-22 · unverdicted · none · ref 24
FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.
Animalbooth: multimodal feature enhancement for animal subject personalization cs.CV · 2025-09-20 · unverdicted · none · ref 17
AnimalBooth introduces an Animal Net, adaptive attention module, and frequency-controlled DCT feature integration to improve identity preservation and perceptual quality in personalized animal image generation, supported by a new high-resolution dataset AnimalBench.
UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization cs.CV · 2026-05-29 · unverdicted · none · ref 31
UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.

Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024

fields

years

verdicts

representative citing papers

citing papers explorer