EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024
11 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
AnimalBooth introduces an Animal Net, adaptive attention module, and frequency-controlled DCT feature integration to improve identity preservation and perceptual quality in personalized animal image generation, supported by a new high-resolution dataset AnimalBench.
UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.
citing papers explorer
-
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
-
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
-
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
-
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
-
MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale
Presents MRT, a 20B-parameter masked region diffusion model unifying text-to-layers, image-to-layers, and layers-to-layers tasks with an overflow-aware canvas layer for complete editable outputs.
-
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
-
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.
-
Animalbooth: multimodal feature enhancement for animal subject personalization
AnimalBooth introduces an Animal Net, adaptive attention module, and frequency-controlled DCT feature integration to improve identity preservation and perceptual quality in personalized animal image generation, supported by a new high-resolution dataset AnimalBench.
-
UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization
UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.