ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
super hub Mixed citations
Denoising Diffusion Implicit Models
Mixed citation behavior. Most common role is background (67%).
abstract
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers
authors
co-cited works
representative citing papers
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.
Spectral Guidance learns singular functions via self-supervised objective to project guidance signals onto diffusion sampling trajectories, enabling stable control without retraining or backpropagation and improving CIFAR-10 accuracy by 37 points with 4x faster sampling.
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
DrawMotion is a diffusion-based framework that fuses text and hand-drawn stickman conditions via a Multi-Condition Module and training-free guidance to generate 3D human motions.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.
IPR improves valid solution rates on MNIST Sudoku from 55.8% to 75.0% by iteratively refining partial regions in sequential diffusion models without external verifiers or reward models.
PolycubeNet applies a dual-latent diffusion architecture to generate polycube point clouds from input point clouds, enabling robust hexahedral mesh creation without surface segmentation or templates.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
DiffIML applies score-based generative modeling to image manipulation localization, recovering coherent masks iteratively from noise to improve generalization on unseen manipulation types.
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
citing papers explorer
-
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
-
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
-
DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
-
DrawMotion: Generating 3D Human Motions by Freehand Drawing
DrawMotion is a diffusion-based framework that fuses text and hand-drawn stickman conditions via a Multi-Condition Module and training-free guidance to generate 3D human motions.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
-
StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
-
Towards Generalized Image Manipulation Localization via Score-based Model
DiffIML applies score-based generative modeling to image manipulation localization, recovering coherent masks iteratively from noise to improve generalization on unseen manipulation types.
-
VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
-
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
-
HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation
HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target-domain risk.
-
Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation
A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.
-
Amortized Guidance for Image Inpainting with Pretrained Diffusion Models
AID amortizes guidance for diffusion inpainting by training a reusable module via an auxiliary Gaussian formulation and continuous-time actor-critic algorithm, improving quality-speed trade-off with under 1% overhead.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
-
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
-
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egocentric motion recovery.
-
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
DMGD achieves better performance than fine-tuned SOTA methods in dataset distillation on ImageNet subsets by using semantic matching through conditional likelihood optimization and OT-based distribution matching in a training-free diffusion setup.
-
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
-
Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
Noise2Map repurposes diffusion model denoising into a direct predictor for semantic segmentation and change detection tasks in remote sensing, achieving top average ranks on benchmark datasets.
-
SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset
SEAL introduces semantic-guided constraints during test-time adaptation to improve identity preservation and contextual control in single-image sticker personalization, backed by a new large-scale tagged sticker dataset.
-
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
-
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection
DFAlign uses diffusion-based denoising to generate foreground knowledge prompts that improve cross-modal alignment for detecting unseen actions in untrimmed videos, reporting state-of-the-art results on OV-TAD benchmarks.
-
Long-Text-to-Image Generation via Compositional Prompt Decomposition
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.
-
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
-
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
-
From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.
-
High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams
An exposure-decoupled modulo formulation and iteration-free diffusion-prior unwrapping enable 1000 FPS full-color HDR imaging on spike cameras while cutting bandwidth from 20 Gbps to 6 Gbps.
-
AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation
A sequential diffusion framework generates controllable abdominal anatomies with a Volume Control Scalar that decouples organ size from body habitus, achieving Dice scores around 0.83 and reducing distributional mismatch by 73.6% in a hepatomegaly example.
-
MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer
MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
-
Your Pre-trained Diffusion Model Secretly Knows Restoration
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
-
InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement
InfBaGel generates consistent human-object-scene interactions via dynamic perception during iterative refinement in a consistency model, bump-aware guidance to avoid collisions, and hybrid training that mixes synthesized pseudo-samples with real HSI data.
-
Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement
EyeBench-V2 is a new benchmark that evaluates retinal fundus enhancement models using downstream clinical tasks, generalization tests, and structured expert assessments to measure real diagnostic utility.
-
GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow
GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS reduction over DCVC-RT.
-
LPNSR: Optimal Noise-Guided Diffusion Image Super-Resolution Via Learnable Noise Prediction
LPNSR derives optimal intermediate noise for diffusion SR via MLE and implements it with an LR-guided noise predictor, reaching SOTA perceptual quality in 4 steps without text priors.
-
Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models
Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
Contour Refinement using Discrete Diffusion in Low Data Regime
A CNN-based discrete diffusion method refines sparse contours from segmentation masks using simplified denoising steps and minimal post-processing, outperforming baselines on small medical and environmental datasets while running 3.5 times faster.