hub Canonical reference

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer · 2022

Canonical reference. 73% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 73% of classified citations

browse 49 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 9 method 2

citation-polarity summary

background 8 use method 2 unclear 1

representative citing papers

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

StreamingEffect: Real-Time Human-Centric Video Effect Generation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T semantic segmentation.

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

AID amortizes guidance for diffusion inpainting by training a reusable module via an auxiliary Gaussian formulation and continuous-time actor-critic algorithm, improving quality-speed trade-off with under 1% overhead.

Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Constraint-Aware Flow Matching integrates constraint projections into the flow matching training objective to align model dynamics with constrained sampling and reduce distributional shift.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

Rethinking Cross-Layer Information Routing in Diffusion Transformers

cs.CV · 2026-05-20 · conditional · novelty 6.0

DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

SafeMark integrates a thresholded watermark-decoding loss into diffusion editors to enable text-guided edits that preserve embedded watermarks with high bit accuracy.

LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design

cs.DC · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LatentBox is a latent-first storage system that cuts persistent storage for AI images by 78.7% while keeping mean and tail latency competitive with traditional pixel storage.

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.

DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

DAD4TS augments small time-series datasets with a diffusion model trained via mathematical geometric projections and guided by reinforcement learning to improve forecasting accuracy.

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

cs.CV · 2026-05-14 · conditional · novelty 6.0

SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

citing papers explorer

Showing 49 of 49 citing papers.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 54
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
StreamingEffect: Real-Time Human-Centric Video Effect Generation cs.CV · 2026-05-16 · unverdicted · none · ref 53
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation cs.CV · 2026-05-14 · unverdicted · none · ref 7
UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T semantic segmentation.
Amortized Guidance for Image Inpainting with Pretrained Diffusion Models cs.CV · 2026-05-13 · unverdicted · none · ref 30
AID amortizes guidance for diffusion inpainting by training a reusable module via an auxiliary Gaussian formulation and continuous-time actor-critic algorithm, improving quality-speed trade-off with under 1% overhead.
Constraint-Aware Flow Matching: Decision Aligned End-to-End Training for Constrained Sampling cs.LG · 2026-05-12 · unverdicted · none · ref 1
Constraint-Aware Flow Matching integrates constraint projections into the flow matching training objective to align model dynamics with constrained sampling and reduce distributional shift.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 31
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes cs.CV · 2026-05-10 · unverdicted · none · ref 73
ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 30
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 38
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives cs.LG · 2026-04-15 · unverdicted · none · ref 24
MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 64 · 2 links
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unverdicted · none · ref 37
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models cs.CV · 2026-05-22 · unverdicted · none · ref 41
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis cs.CV · 2026-05-20 · unverdicted · none · ref 5
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · conditional · none · ref 42
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 4
AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).
Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing cs.CV · 2026-05-19 · unverdicted · none · ref 1
SafeMark integrates a thresholded watermark-decoding loss into diffusion editors to enable text-guided edits that preserve embedded watermarks with high bit accuracy.
LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design cs.DC · 2026-05-19 · unverdicted · none · ref 54 · 2 links
LatentBox is a latent-first storage system that cuts persistent storage for AI images by 78.7% while keeping mean and tail latency competitive with traditional pixel storage.
PIXLRelight: Controllable Relighting via Intrinsic Conditioning cs.CV · 2026-05-18 · unverdicted · none · ref 37
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data cs.LG · 2026-05-18 · unverdicted · none · ref 33
DAD4TS augments small time-series datasets with a diffusion model trained via mathematical geometric projections and guided by reinforcement learning to improve forecasting accuracy.
Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 31
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CV · 2026-05-15 · unverdicted · none · ref 18
Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection cs.CV · 2026-05-14 · conditional · none · ref 46
SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 28
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 11
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Filtering Memorization from Parameter-Space in Diffusion Models cs.CV · 2026-05-11 · unverdicted · none · ref 2
BAF reduces memorization in diffusion LoRAs by filtering spectral channels of the adaptation weights that show weak alignment with the base model's principal subspace.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 69
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation cs.CV · 2026-05-07 · unverdicted · none · ref 2
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
Intermediate Representations are Strong AI-Generated Image Detectors cs.CV · 2026-05-05 · unverdicted · none · ref 48
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 28
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 44 · 2 links
Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads cs.DC · 2026-04-06 · unverdicted · none · ref 30
GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
Energy-Guided Generative Modeling for Low-Energy Molecular Structure Discovery cs.LG · 2025-12-27 · unverdicted · none · ref 74
EnFlow integrates flow-based conformer generation with energy landscape modeling to enable joint ensemble generation and ground-state identification using only 1-2 ODE steps.
Kling-Omni Technical Report cs.CV · 2025-12-18 · unverdicted · none · ref 23
Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 2
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 69
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
VideoPhy: Evaluating Physical Commonsense for Video Generation cs.CV · 2024-06-05 · conditional · none · ref 86
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration cs.CV · 2026-05-20 · unverdicted · none · ref 39
Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 77
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 27
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
InsHuman: Towards Natural and Identity-Preserving Human Insertion cs.CV · 2026-05-08 · unverdicted · none · ref 1
InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 33
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Scaling Properties of Continuous Diffusion Spoken Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 16
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search cs.IR · 2026-02-17 · unverdicted · none · ref 37
GaiaFlow combines semantic-guided diffusion tuning with early-exit and quantization methods to lower carbon emissions in neural information retrieval while maintaining competitive effectiveness.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 103
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift cs.CV · 2025-05-26 · unverdicted · none · ref 8
Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 68 · 2 links
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction cs.CV · 2026-04-13 · unverdicted · none · ref 36
An end-to-end framework redacts PHI from medical images via CRNN detection and restores them with Stable Diffusion inpainting to enable privacy-preserving data sharing without losing downstream utility.
STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding eess.IV · 2026-05-22 · unreviewed · ref 25

High-resolution image synthesis with latent diffusion models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer