Transformers converge globally to the optimal DDPM denoiser for multi-token GMMs via self-attention mean denoising, with explicit token and iteration requirements.
High-resolution image synthesis with latent diffusion models
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
citing papers explorer
-
Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Transformers converge globally to the optimal DDPM denoiser for multi-token GMMs via self-attention mean denoising, with explicit token and iteration requirements.
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
RoboDreamer: Learning Compositional World Models for Robot Imagination
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.