ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
super hub Mixed citations
Denoising Diffusion Implicit Models
Mixed citation behavior. Most common role is background (67%).
abstract
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers
authors
co-cited works
representative citing papers
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Introduces a Bridge latent interface that maps mismatched student latents into teacher space, enabling distillation from modern diffusion teachers to compact one-step students and raising SD 1.5 HPSv3 from 5.4 to 9.4 while keeping one-step speed.
LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted from depth variations.
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
Introduces the ASTAD task and training-free ASTModel framework for semantically consistent asymmetric style transfer using labeled synthetic content and unlabeled real references.
SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.
Spectral Guidance learns singular functions via self-supervised objective to project guidance signals onto diffusion sampling trajectories, enabling stable control without retraining or backpropagation and improving CIFAR-10 accuracy by 37 points with 4x faster sampling.
ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, supported by the new HAP dataset and HAF-Bench.
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
DrawMotion is a diffusion-based framework that fuses text and hand-drawn stickman conditions via a Multi-Condition Module and training-free guidance to generate 3D human motions.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
FlowErase-RL applies GRPO to reformulate concept erasure in flow matching models as reward optimization using a dynamic dual-path mechanism for target suppression and non-target preservation.
citing papers explorer
-
Emu3.5: Native Multimodal Models are World Learners
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Control-Augmented Autoregressive Diffusion for Data Assimilation
An offline-trained controller augments autoregressive diffusion models to perform fast, feed-forward data assimilation in chaotic spatiotemporal PDEs with order-of-magnitude speedups and improved accuracy over baselines.
-
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
Locate-Then-Examine improves AI-generated image detection by localizing suspicious regions first then performing region-aware re-examination, while releasing the TRACE dataset of 20k annotated images.
-
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
-
Sample-Efficient Optimisation over the Outputs of Generative Models
O3 uses surrogate latent spaces extracted from generative models to perform sample-efficient black-box optimization over their outputs, outperforming direct sampling and original-latent optimization on image and protein tasks.
-
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
-
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
-
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
-
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
HERO accelerates world model inference 1.73x via hierarchical patch-wise refresh in shallow layers and linear extrapolation in deeper layers with minimal quality loss.
-
IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
A diffusion framework decomposes images into intrinsic maps via an inverse renderer and renders controllable weather changes via a forward renderer with CLIP prompt interpolation and map-aware attention, outperforming pixel-space baselines on new 38k synthetic and 18k real datasets.
-
Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation
Synthetic data augmentation improves instance segmentation performance for chicken carcasses when real annotated data is limited.
-
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
SmokeSVD reconstructs dynamic smoke from a single video via diffusion-based side-view synthesis, progressive multi-view refinement, and Navier-Stokes-guided density-velocity estimation.
-
Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions
Stein Diffusion Guidance corrects approximate posteriors in diffusion sampling via a Stein variational mechanism and surrogate SOC objective to enable effective guidance beyond high-density regimes.
-
DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation
DexWrist presents a 0.97 kg robotic wrist with 3.75 Nm torque, 0.33 Nm backdrive torque, and 10 Hz bandwidth that improves success rates by 50-76% on constrained manipulation tasks.
-
ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.
-
GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
-
ReSim: Reliable World Simulation for Autonomous Driving
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.
-
2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
2ndMatch finetunes pruned diffusion models via second-order Jacobian matching inspired by Finite-Time Lyapunov Exponents to reduce the quality gap with dense models on image generation tasks.
-
Latent Stochastic Interpolants
Latent Stochastic Interpolants jointly optimize encoder-decoder and a latent-space stochastic interpolant using a continuous-time ELBO to transform arbitrary priors into aggregated posteriors.
-
Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving
EnDfuser replaces point-estimate trajectory planning with ensemble diffusion in a single attention-pooling transformer module to model posterior trajectory uncertainty and improve safety in end-to-end autonomous driving.
-
Fast Kernel-Space Diffusion for Remote Sensing Pansharpening
KSDiff generates convolutional kernels in kernel space using low-rank core tensor and factor generators with multi-head attention for fast, high-quality pansharpening.
-
DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion
DreamPolicy integrates an autoregressive diffusion world model with policy learning to produce a single scalable policy that generalizes to unseen composite terrains for humanoid locomotion.
-
Flow-based Generative Modeling of Potential Outcomes and Counterfactuals
PO-Flow uses continuous normalizing flows trained via flow matching to jointly model potential outcome distributions and enable factual-conditioned counterfactual prediction for causal inference tasks including CATE estimation.
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
Sampling-Aware Quantization for Diffusion Models
A quantization technique for diffusion models that aligns sampling trajectories to preserve high-order sampler performance under quantization noise.
-
SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification
SD-ReID trains a ViT to extract identity and view conditions, fine-tunes Stable Diffusion to generate view-mimicking features, adds a View-Refined Decoder, and combines both identity and all-view features for retrieval on aerial-ground re-identification benchmarks.
-
SEAL: Semantic Aware Image Watermarking
SEAL uses semantic embeddings and locality-sensitive hashing to create distortion-free, database-free watermarks for generative images that are conditioned on content for improved forgery resistance.
-
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
-
NullFace: Training-Free Localized Face Anonymization
NullFace performs training-free localized face anonymization by inverting images to noise and denoising with modified identity embeddings from a pre-trained diffusion model.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models
BadRDM is a backdoor attack on retrieval-augmented diffusion models that poisons the retrieval database with toxicity surrogates and uses multimodal contrastive learning to force toxic generations from text triggers while preserving benign performance.
-
OmniPrism: Learning Disentangled Visual Concept for Image Generation
OmniPrism proposes a disentanglement method using a new paired dataset (PCD-200K), COD contrastive training, and block embeddings to inject separated concepts into diffusion models for multi-aspect image generation.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
-
DGSNA: Dynamic Generative Scene-based Noise Addition method
DGSNA dynamically generates scene-specific noise via prompt-driven language models and text-to-audio diffusion, then mixes it with speech to improve recognition and keyword spotting robustness by up to 11.32%.
-
Conjuring Semantic Similarity
Semantic similarity between texts is measured by the Jeffreys divergence between the image distributions induced by conditioning a diffusion model on each text, computed via Monte-Carlo sampling of the reverse-time SDEs.
-
Diffusion Policy Policy Optimization
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
-
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.
-
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
A multi-view diffusion model generates consistent novel views from sparse images to enable fast 3D scene reconstruction.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
-
Diff-PCR: Diffusion-Based Correspondence Searching in Doubly Stochastic Matrix Space for Point Cloud Registration
Diff-PCR uses a diffusion model to learn denoising directions for refining doubly stochastic correspondence matrices, improving point cloud registration over one-shot normalization methods.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
Improved Techniques for Training Consistency Models
Improved consistency training techniques achieve FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in one sampling step, outperforming prior consistency training and distillation methods.
-
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
SyncDreamer produces multiview-consistent images from a single input image by jointly modeling their distribution and synchronizing intermediate diffusion states via 3D-aware attention.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
-
Generative diffusion learning for parametric partial differential equations
A conditional DDPM framework is introduced to approximate solution operators for parameter-dependent PDEs, achieving accuracy comparable to FNO while recovering noise levels and providing confidence intervals.