MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
Unicontrol: A unified diffusion model for controllable visual generation in the wild
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
PiERN proposes token-level routing of physically-isolated experts to embed high-precision computation directly into LLMs, reporting higher accuracy and lower latency, token count, and energy use than fine-tuning or multi-agent baselines.
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
Introduces ProductConsistency dataset, benchmark, and Cyclic Consistency reward to fine-tune image editing models, achieving a 5x reduction in character error rate for product identity preservation.
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
UNITY is a two-stage adapter with Morphable Attention Flow networks for efficient single and composite conditioning in diffusion-based image generation.
citing papers explorer
-
InstanceControl: Controllable Complex Image Generation without Instance Labeling
InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.
-
OmniGen-AR: AutoRegressive Any-to-Image Generation
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning
PiERN proposes token-level routing of physically-isolated experts to embed high-precision computation directly into LLMs, reporting higher accuracy and lower latency, token count, and energy use than fine-tuning or multi-agent baselines.
-
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
-
ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL
Introduces ProductConsistency dataset, benchmark, and Cyclic Consistency reward to fine-tune image editing models, achieving a 5x reduction in character error rate for product identity preservation.
-
IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations
IdentiFace is a multi-modal iterative diffusion framework that generates identifiable suspect faces with improved identity retrieval for law enforcement applications.
-
UNITY: Attention Flow Networks for Adaptive Conditioning in Diffusion
UNITY is a two-stage adapter with Morphable Attention Flow networks for efficient single and composite conditioning in diffusion-based image generation.