ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
super hub Mixed citations
Denoising Diffusion Implicit Models
Mixed citation behavior. Most common role is background (67%).
abstract
Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose revers
authors
co-cited works
representative citing papers
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.
DRDD decouples diffusion into independent noise and residual stages to preserve domain harmonization and enable unified data-efficient I2I translation.
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Midpoint Generative Models define a midpoint divergence from flow matching symmetry and derive its variational form as a tractable objective for training competitive one-step generators.
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adaptive caching.
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
DrawMotion is a diffusion-based framework that fuses text and hand-drawn stickman conditions via a Multi-Condition Module and training-free guidance to generate 3D human motions.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.
IPR improves valid solution rates on MNIST Sudoku from 55.8% to 75.0% by iteratively refining partial regions in sequential diffusion models without external verifiers or reward models.
PolycubeNet applies a dual-latent diffusion architecture to generate polycube point clouds from input point clouds, enabling robust hexahedral mesh creation without surface segmentation or templates.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
DiffIML applies score-based generative modeling to image manipulation localization, recovering coherent masks iteratively from noise to improve generalization on unseen manipulation types.
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
citing papers explorer
-
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
-
Point Tracking Improves World Action Models
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
-
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
-
Muninn: Your Trajectory Diffusion Model But Faster
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation
ReV is a referring-aware visuomotor policy using coupled diffusion heads for real-time trajectory replanning in robotic manipulation, trained solely via targeted perturbations to expert demonstrations and achieving higher success rates in simulated and real tasks.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
Action-to-Action Flow Matching
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
-
Information Filtering via Variational Regularization for Robot Manipulation
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
-
When Backdoors Meet Partial Observability: Attacking Real-World Reinforcement Learning
DGBA enables reliable backdoor attacks on real-world RL policies under partial observability by learning stochastic visual triggers via conditional diffusion and using advantage-based poisoning at critical states.
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
HITL-D: Human In The Loop Diffusion Assisted Shared Control
HITL-D combines diffusion policies with human input for shared robotic control, reducing required joystick axes and improving speed and workload in manipulation tasks per a 12-participant study.
-
Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives
BeyondDrive augments imitation learning with synthesized safety-critical negative trajectories and a repulsive loss to improve safety in autonomous driving, reporting 89.7 PDMS on NAVSIMv1 and generalization to other models.
-
Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic
STRELGen combines a multi-agent diffusion model with differentiable STREL specifications to optimize latent space for generating plausible yet safety-critical driving scenarios.
-
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on five simulated plus two real manipulation tasks.
-
GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment
GSDrive combines IL priors with RL feedback by probing multi-mode futures inside a 3D Gaussian Splatting simulator to supply dense rewards for closed-loop driving policy improvement on nuScenes.
-
DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications
DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.
-
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronger closed-loop performance and feasibility on NAVSIM.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
-
One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation
A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.
-
Flexible Multitask Learning with Factorized Diffusion Policy
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
-
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
-
DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation
DexWrist presents a 0.97 kg robotic wrist with 3.75 Nm torque, 0.33 Nm backdrive torque, and 10 Hz bandwidth that improves success rates by 50-76% on constrained manipulation tasks.
-
ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.
-
GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
-
Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving
EnDfuser replaces point-estimate trajectory planning with ensemble diffusion in a single attention-pooling transformer module to model posterior trajectory uncertainty and improve safety in end-to-end autonomous driving.
-
DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion
DreamPolicy integrates an autoregressive diffusion world model with policy learning to produce a single scalable policy that generalizes to unseen composite terrains for humanoid locomotion.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
-
Diffusion Policy Policy Optimization
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
-
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
-
Scaling Robot Learning with Semantically Imagined Experience
Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
A hierarchical tactile-aware policy combines human-demonstration training for contact cue prediction with sim-to-real reinforcement learning to improve quadrupedal loco-manipulation performance by 28.54% over vision baselines on contact-rich tasks.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
Diffusion-Based Optimization for Accelerated Convergence of Redundant Dual-Arm Minimum Time Problems
A novel diffusion variant accelerates minimum-time planning for redundant dual-arm robots by replacing gradient-based solving of the nonconvex high-level problem with probabilistic sampling, yielding 35x faster runtime and 34% less path error.
-
Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning
Sparse ActionGen accelerates diffusion policies up to 4x for robot control via rollout-adaptive pruning and zig-zag activation reuse without performance loss.
-
Contact-Rich Robotic Assembly in Construction via Diffusion Policy Learning
Diffusion policies achieve 100% success on nominal mortise-tenon timber assembly and 75% average success under randomized 10 mm perturbations using force/torque sensing on an industrial robot.
-
GRITS: A Spillage-Aware Guided Diffusion Policy for Robot Food Scooping Tasks
GRITS combines guided diffusion with a sim-trained spillage predictor to reach 82% success and 4% spillage on ten unseen real food categories, cutting spillage over 40% versus unguided baselines.
-
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pretraining size and diversity.
-
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
-
EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation
EL3DD extends latent 3D diffusion with language inputs and reference demonstrations to improve success rates on sequential manipulation tasks in the CALVIN dataset.
- Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control
- You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector