super hub Mixed citations

Denoising Diffusion Implicit Models

Chenlin Meng, Jiaming Song · 2020 · cs.LG · arXiv 2010.02502

Mixed citation behavior. Most common role is background (67%).

533 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 533 citing papers more from Chenlin Meng arXiv PDF

abstract

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 58 method 23 baseline 2

citation-polarity summary

background 56 use method 23 baseline 2 support 1 unclear 1

claims ledger

abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers

authors

and Stefano Ermon Chenlin Meng Jiaming Song

co-cited works

representative citing papers

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

cs.CV · 2026-06-09 · conditional · novelty 8.0

Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

cs.RO · 2026-06-09 · unverdicted · novelty 8.0

TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

cs.CV · 2026-04-04 · unverdicted · novelty 8.0

ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

cs.LG · 2026-07-01 · unverdicted · novelty 7.0

Flow-Map GRPO uses anchored stochastic flow map composition to enable GRPO-based RL alignment of deterministic few-step flow-map generators while preserving their marginal paths.

Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces a Bridge latent interface that maps mismatched student latents into teacher space, enabling distillation from modern diffusion teachers to compact one-step students and raising SD 1.5 HPSv3 from 5.4 to 9.4 while keeping one-step speed.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

ASTAD: Asymmetric Style Transfer for Synthetic-to-Real Adaptation in Autonomous Driving

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces the ASTAD task and training-free ASTModel framework for semantically consistent asymmetric style transfer using labeled synthetic content and unlabeled real references.

Diffusion Model Attribution via Spectral Coupling of Denoiser Responses

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

MaskAlign uses random token-subset alignment and pre-mask mixing to reduce diffusion models' reliance on complete clean-image token sets during representation alignment.

Where the Score Lives: A Wavelet View of Diffusion

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

Derives optimal score functions for diffusion models as wavelet expansions in terms of data moments, enabling architecture-agnostic analysis of which distribution attributes matter for denoising.

Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

Consistent-Inversion introduces reverse consistency guidance that corrects early target denoising steps by checking reversibility toward the source inversion trajectory under the original prompt.

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

cs.CV · 2026-06-04 · conditional · novelty 7.0

Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.

Reflection Separation from a Single Image via Joint Latent Diffusion

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A joint latent diffusion model with cross-layer self-attention and disjoint sampling separates reflection and transmission layers from single images more effectively than prior methods on real-world benchmarks.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

cs.RO · 2026-05-28 · unverdicted · novelty 7.0

CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.

Midpoint Generative Models

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Spectral Guidance learns singular functions via self-supervised objective to project guidance signals onto diffusion sampling trajectories, enabling stable control without retraining or backpropagation and improving CIFAR-10 accuracy by 37 points with 4x faster sampling.

Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, supported by the new HAP dataset and HAF-Bench.

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline cs.CV · 2024-12-27 · unverdicted · none · ref 34 · internal anchor
Introduces the MMTT dataset of 152k manipulated facial images with masks and text descriptions, plus the ForgeryTalker model that jointly outputs localization masks and explanatory text, reporting 59.3 CIDEr and 73.67 IoU.
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms cs.SD · 2024-11-24 · conditional · none · ref 21 · internal anchor
Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 71 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
One Step Diffusion via Shortcut Models cs.LG · 2024-10-16 · conditional · none · ref 23 · internal anchor
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Diffusion Models Are Real-Time Game Engines cs.LG · 2024-08-27 · conditional · none · ref 91 · internal anchor
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 30 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
OmniPrism: Learning Disentangled Visual Concept for Image Generation cs.CV · 2024-12-16 · unverdicted · none · ref 36 · internal anchor
OmniPrism proposes a disentanglement method using a new paired dataset (PCD-200K), COD contrastive training, and block embeddings to inject separated concepts into diffusion models for multi-aspect image generation.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation cs.RO · 2024-11-29 · unverdicted · none · ref 59 · internal anchor
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
DGSNA: Dynamic Generative Scene-based Noise Addition method cs.SD · 2024-11-19 · unverdicted · none · ref 61 · internal anchor
DGSNA dynamically generates scene-specific noise via prompt-driven language models and text-to-audio diffusion, then mixes it with speech to improve recognition and keyword spotting robustness by up to 11.32%.
Conjuring Semantic Similarity cs.AI · 2024-10-21 · unverdicted · none · ref 23 · internal anchor
Semantic similarity between texts is measured by the Jeffreys divergence between the image distributions induced by conditioning a diffusion model on each text, computed via Monte-Carlo sampling of the reverse-time SDEs.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 87 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
VideoPhy: Evaluating Physical Commonsense for Video Generation cs.CV · 2024-06-05 · conditional · none · ref 93 · internal anchor
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference cs.CV · 2024-05-23 · unverdicted · none · ref 17 · internal anchor
PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.
CAT3D: Create Anything in 3D with Multi-View Diffusion Models cs.CV · 2024-05-16 · conditional · none · ref 86 · internal anchor
A multi-view diffusion model generates consistent novel views from sparse images to enable fast 3D scene reconstruction.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 149 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations cs.RO · 2024-02-16 · conditional · none · ref 66 · internal anchor
3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation cs.RO · 2024-01-04 · conditional · none · ref 85 · internal anchor
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation cs.CV · 2024-11-28 · unverdicted · none · ref 15 · internal anchor
SOW uses MLLMs and attention to selectively control unidirectional diffusion for pixel-level fidelity and contextual coherence in text-vision-to-image tasks.
KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos cs.CV · 2024-11-20 · unverdicted · none · ref 62 · internal anchor
KFC-W is a self-supervised 3D-aware video model trained on videos and multiview internet photos that produces geometrically consistent interpolations between unposed input images without any 3D annotations.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 64 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Diffusion Models are Evolutionary Algorithms cs.NE · 2024-10-03 · unverdicted · none · ref 10 · internal anchor
Diffusion models are evolutionary algorithms via a denoising-evolution equivalence, yielding Diffusion Evolution that outperforms mainstream EAs on multi-optima tasks.
A Survey on Diffusion Models for Inverse Problems cs.LG · 2024-09-30 · unverdicted · none · ref 132 · internal anchor
A survey that introduces taxonomies for categorizing pre-trained diffusion model methods applied to inverse problems and analyzes their connections and challenges.
Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification cs.CV · 2024-06-23 · unverdicted · none · ref 51 · internal anchor
Pose-dIVE augments Re-ID training sets with diffusion-generated images of diverse poses and viewpoints by conditioning on SMPL parameters.
Open-Sora Plan: Open-Source Large Video Generation Model cs.CV · 2024-11-28 · unverdicted · none · ref 17 · internal anchor
Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.
MSG Score: Automated Video Verification for Reliable Multi-Scene Generation cs.CV · 2024-11-28 · unverdicted · none · ref 2 · internal anchor
Proposes MSG score as core of CGS framework plus IID distillation for automated, fast verification of long-form text-to-video outputs.
Flemme: A Flexible and Modular Learning Platform for Medical Images eess.IV · 2024-08-18 · unverdicted · none · ref 15 · internal anchor
Flemme is a modular platform separating encoders (conv/transformer/SSM) from encoder-decoder architectures for medical images, with a hierarchical pyramid loss yielding reported average gains of 5.6% Dice and 5.57% PSNR.

Denoising Diffusion Implicit Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer