Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.
super hub Mixed citations
Denoising Diffusion Implicit Models
Mixed citation behavior. Most common role is background (67%).
abstract
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers
authors
co-cited works
representative citing papers
TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Flow-Map GRPO uses anchored stochastic flow map composition to enable GRPO-based RL alignment of deterministic few-step flow-map generators while preserving their marginal paths.
Introduces a Bridge latent interface that maps mismatched student latents into teacher space, enabling distillation from modern diffusion teachers to compact one-step students and raising SD 1.5 HPSv3 from 5.4 to 9.4 while keeping one-step speed.
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
Introduces the ASTAD task and training-free ASTModel framework for semantically consistent asymmetric style transfer using labeled synthetic content and unlabeled real references.
SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.
MaskAlign uses random token-subset alignment and pre-mask mixing to reduce diffusion models' reliance on complete clean-image token sets during representation alignment.
Derives optimal score functions for diffusion models as wavelet expansions in terms of data moments, enabling architecture-agnostic analysis of which distribution attributes matter for denoising.
Consistent-Inversion introduces reverse consistency guidance that corrects early target denoising steps by checking reversibility toward the source inversion trajectory under the original prompt.
Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.
A joint latent diffusion model with cross-layer self-attention and disjoint sampling separates reflection and transmission layers from single images more effectively than prior methods on real-world benchmarks.
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.
Spectral Guidance learns singular functions via self-supervised objective to project guidance signals onto diffusion sampling trajectories, enabling stable control without retraining or backpropagation and improving CIFAR-10 accuracy by 37 points with 4x faster sampling.
ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, supported by the new HAP dataset and HAF-Bench.
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
citing papers explorer
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Stream-DiffVSR enables practical low-latency video super-resolution by combining a four-step distilled denoiser, auto-regressive temporal guidance, and a temporal processor in a strictly causal pipeline.
-
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.
-
GLUE: Coordinating Pre-Trained Generative Models for System-Level Design
GLUE orchestrates frozen pre-trained generative models into a system-level design generator that enforces feasibility, performance, and diversity, with data-driven and data-free variants benchmarked on UAV design.
-
LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents
LangDriveCTRL decomposes driving videos into 3D scene graphs and uses an agentic pipeline with specialized multi-modal agents to perform language-controlled object and behavior edits, achieving nearly 2x higher instruction alignment than prior state-of-the-art methods.
-
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
-
RDSplat: Robust Watermarking for 3D Gaussian Splatting Against 2D and 3D Diffusion Editing
RDSplat is the first 3D Gaussian Splatting watermarking method that maintains 0.701 bit accuracy against both 2D and 3D diffusion editing by embedding only in low-frequency primitives selected via FAPS.
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models
Introduces noise aggregation analysis with single-step small-noise injection to enable efficient and accurate membership inference attacks on diffusion models.
-
Exploring Cross-Modal Flows for Few-Shot Learning
FMA introduces flow matching for multi-step cross-modal feature alignment in few-shot learning, using fixed coupling, noise augmentation, and early-stopping to outperform one-step PEFT methods.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
-
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
-
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
-
Patient-Adaptive Echocardiography using Cognitive Ultrasound
A temporal diffusion model enables adaptive selection of focused ultrasound transmits, outperforming random subsampling and diverging waves on EchoNet-Dynamic and in-house echocardiogram datasets while supporting real-time operation.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness
MAGIC is a few-shot mask-guided anomaly inpainting framework using Gaussian prompt perturbation, spatially adaptive guidance, and context-aware mask alignment to produce high-fidelity, diverse anomalies that outperform prior methods on downstream detection tasks.
-
Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation
Doloris introduces dual conditional diffusion implicit bridges plus a sparsity masking strategy to model unpaired single-cell perturbation responses and reports state-of-the-art results on public datasets.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
-
Beyond Blur: A Fluid Perspective on Generative Diffusion Models
Proposes an advection-diffusion PDE corruption process with stochastic velocity fields and Lattice Boltzmann solver for diffusion models, generalizing prior PDE methods.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation
FD-Bench supplies the first modular, reproducible benchmark and leaderboard for comparing neural PDE solvers on fluid dynamics tasks with direct numerical solver baselines.
-
COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations
COCO-Inpaint supplies a large-scale dataset and evaluation protocol focused on inpainting-based image forgeries to benchmark existing detection methods.
-
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
-
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Gungnir shows that style-based triggers with RAN and STTR techniques can activate backdoors in diffusion models while evading detection and surviving fine-tuning.
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
Flexible Multitask Learning with Factorized Diffusion Policy
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
-
Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving
A platform using flow matching for real-world image generation and an adversarial policy creates challenging corner cases to evaluate end-to-end autonomous driving models like UniAD and VAD, showing performance degradation.
-
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
-
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
-
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
-
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
ViPO enhances GRPO for visual generation by creating spatially and temporally aware advantage maps from pretrained vision models to focus optimization on perceptually important regions.
-
SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
SPAGS reconstructs articulated objects from sparse single-state RGB images by constraining Gaussians to planar primitives, optimizing with depth and diffusion priors, and using a VLM for part segmentation and joint estimation.
-
Efficient Score Pre-computation for Diffusion Models via Cross-Matrix Krylov Projection
Cross-matrix Krylov projection reuses shared subspaces from seed matrices to accelerate score pre-computation in diffusion models, delivering 15.8-43.7% time savings and up to 115x speedup versus DDPM baselines.
-
UniSER: A Foundation Model for Unified Soft Effects Removal
UniSER is a unified diffusion transformer foundation model that removes diverse soft image degradations by training on a large curated dataset of semi-transparent occlusions with fine-grained controls.
-
Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling
HSO is a bi-level optimization method with Midpoint Error Proxy and Spacing-Penalized Fitness that finds robust timestep schedules for low-NFE diffusion sampling and reports SOTA FID scores such as 11.94 at NFE=5.
-
Emu3.5: Native Multimodal Models are World Learners
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Control-Augmented Autoregressive Diffusion for Data Assimilation
An offline-trained controller augments autoregressive diffusion models to perform fast, feed-forward data assimilation in chaotic spatiotemporal PDEs with order-of-magnitude speedups and improved accuracy over baselines.
-
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
Locate-Then-Examine improves AI-generated image detection by localizing suspicious regions first then performing region-aware re-examination, while releasing the TRACE dataset of 20k annotated images.
-
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
-
Sample-Efficient Optimisation over the Outputs of Generative Models
O3 uses surrogate latent spaces extracted from generative models to perform sample-efficient black-box optimization over their outputs, outperforming direct sampling and original-latent optimization on image and protein tasks.
-
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
-
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
-
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
-
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
HERO accelerates world model inference 1.73x via hierarchical patch-wise refresh in shallow layers and linear extrapolation in deeper layers with minimal quality loss.
-
IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
A diffusion framework decomposes images into intrinsic maps via an inverse renderer and renders controllable weather changes via a forward renderer with CLIP prompt interpolation and map-aware attention, outperforming pixel-space baselines on new 38k synthetic and 18k real datasets.
-
Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation
Synthetic data augmentation improves instance segmentation performance for chicken carcasses when real annotated data is limited.
-
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
SmokeSVD reconstructs dynamic smoke from a single video via diffusion-based side-view synthesis, progressive multi-view refinement, and Navier-Stokes-guided density-velocity estimation.
-
Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions
Stein Diffusion Guidance corrects approximate posteriors in diffusion sampling via a Stein variational mechanism and surrogate SOC objective to enable effective guidance beyond high-density regimes.