UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
hub
Stochastic Adversarial Video Prediction
11 Pith papers cite this work. Polarity classification is still indexing.
abstract
Being able to predict what may happen in the future requires an in-depth understanding of the physical and causal rules that govern the world. A model that is able to do so has a number of appealing applications, from robotic planning to representation learning. However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging -- the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction. Recently, this has been addressed by two distinct approaches: (a) latent variational variable models that explicitly model underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures. Our method outperforms prior and concurrent work in these aspects.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
PFGNet introduces a frequency-guided peripheral gating block in a pure convolutional architecture to enable adaptive receptive fields for efficient spatiotemporal prediction with fewer parameters than prior methods.
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.
SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.
citing papers explorer
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
-
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
PFGNet introduces a frequency-guided peripheral gating block in a pure convolutional architecture to enable adaptive receptive fields for efficient spatiotemporal prediction with fewer parameters than prior methods.
-
Video Generators are Robot Policies
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
RoboNet: Large-Scale Multi-Robot Learning
RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.
-
Order Matters: Shuffling Sequence Generation for Video Prediction
SEE-Net improves video prediction by using frame shuffling to enforce learning of natural temporal order, reporting state-of-the-art results on three synthetic and real-world datasets.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
Multi-Modal World Model for Physical Robot Interactions: Simultaneous Visual and Tactile Predictions for Enhanced Accuracy
Visuo-tactile world models improve prediction accuracy in physically ambiguous robot-pushing scenarios, demonstrated on two new datasets with a magnetic tactile sensor.