ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
super hub Mixed citations
Denoising Diffusion Implicit Models
Mixed citation behavior. Most common role is background (67%).
abstract
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers
authors
co-cited works
representative citing papers
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
Introduces the ASTAD task and training-free ASTModel framework for semantically consistent asymmetric style transfer using labeled synthetic content and unlabeled real references.
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.
Spectral Guidance learns singular functions via self-supervised objective to project guidance signals onto diffusion sampling trajectories, enabling stable control without retraining or backpropagation and improving CIFAR-10 accuracy by 37 points with 4x faster sampling.
ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, supported by the new HAP dataset and HAF-Bench.
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
DrawMotion is a diffusion-based framework that fuses text and hand-drawn stickman conditions via a Multi-Condition Module and training-free guidance to generate 3D human motions.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.
IPR improves valid solution rates on MNIST Sudoku from 55.8% to 75.0% by iteratively refining partial regions in sequential diffusion models without external verifiers or reward models.
PolycubeNet applies a dual-latent diffusion architecture to generate polycube point clouds from input point clouds, enabling robust hexahedral mesh creation without surface segmentation or templates.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
citing papers explorer
-
RDSplat: Robust Watermarking for 3D Gaussian Splatting Against 2D and 3D Diffusion Editing
RDSplat is the first 3D Gaussian Splatting watermarking method that maintains 0.701 bit accuracy against both 2D and 3D diffusion editing by embedding only in low-frequency primitives selected via FAPS.
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models
Introduces noise aggregation analysis with single-step small-noise injection to enable efficient and accurate membership inference attacks on diffusion models.
-
Exploring Cross-Modal Flows for Few-Shot Learning
FMA introduces flow matching for multi-step cross-modal feature alignment in few-shot learning, using fixed coupling, noise augmentation, and early-stopping to outperform one-step PEFT methods.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
-
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
-
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
-
Patient-Adaptive Echocardiography using Cognitive Ultrasound
A temporal diffusion model enables adaptive selection of focused ultrasound transmits, outperforming random subsampling and diverging waves on EchoNet-Dynamic and in-house echocardiogram datasets while supporting real-time operation.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness
MAGIC is a few-shot mask-guided anomaly inpainting framework using Gaussian prompt perturbation, spatially adaptive guidance, and context-aware mask alignment to produce high-fidelity, diverse anomalies that outperform prior methods on downstream detection tasks.
-
Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation
Doloris introduces dual conditional diffusion implicit bridges plus a sparsity masking strategy to model unpaired single-cell perturbation responses and reports state-of-the-art results on public datasets.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
-
Beyond Blur: A Fluid Perspective on Generative Diffusion Models
Proposes an advection-diffusion PDE corruption process with stochastic velocity fields and Lattice Boltzmann solver for diffusion models, generalizing prior PDE methods.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation
FD-Bench supplies the first modular, reproducible benchmark and leaderboard for comparing neural PDE solvers on fluid dynamics tasks with direct numerical solver baselines.
-
COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations
COCO-Inpaint supplies a large-scale dataset and evaluation protocol focused on inpainting-based image forgeries to benchmark existing detection methods.
-
UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.
-
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Gungnir shows that style-based triggers with RAN and STTR techniques can activate backdoors in diffusion models while evading detection and surviving fine-tuning.
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
Generating Attribution Reports for Manipulated Facial Images: A Dataset and Baseline
Introduces the MMTT dataset of 152k manipulated facial images with masks and text descriptions, plus the ForgeryTalker model that jointly outputs localization masks and explanatory text, reporting 59.3 CIDEr and 73.67 IoU.
-
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms
Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
One Step Diffusion via Shortcut Models
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
-
Diffusion Models Are Real-Time Game Engines
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
Scalable Diffusion Models with Transformers
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
Human Motion Diffusion Model
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
-
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection
LEGO uses multiple generator-specific LoRA modules modulated by an MLP and fused with attention to detect synthetic images, achieving better performance than prior methods while using under 10% of the training data.
-
Cast3: Translating numerical weather prediction principles into data-driven forecasting
Cast3 translates NWP principles into a data-driven model using cubed-sphere grids, super-ensembles, and generative nudging to achieve state-of-the-art ensemble predictions that outperform baselines.
-
Generative diffusion models for spatiotemporal influenza forecasting
Influpaint uses generative diffusion models on image-encoded influenza data to produce realistic and diverse epidemic trajectories that match leading ensemble methods in accuracy.
-
3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement
Presents a scene-adaptive 3D human animation method using ground-adaptive motion retargeting and viewpoint-adaptive latent fusion to control human trajectories and camera views, reporting gains on two benchmarks.
-
EMOSH: Expressive Motion and Shape Disentanglement for Human Animation
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
-
Latent Visual Diffusion Reasoning with Monte Carlo Tree Search
LVDR integrates keypoint-guided MCTS into a latent diffusion reasoning model to deliver competitive skill assessment accuracy alongside explicit visual reasoning trajectories on four sports and surgical datasets.
-
RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance
RS-Diffuser integrates diffusion planners, quantile regression critics, and CVaR-style guidance to produce risk-averse to risk-seeking behaviors from one model in offline RL.
-
PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion
PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.
-
GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion
GeoFace generates consistent multi-view face images and 3D geometry from one input via a dual-stream diffusion framework with geometry-guided attention alignment.
-
Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.