UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Laion-5b: An open large-scale dataset for training next generation image-text models
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
citing papers explorer
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.