RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
hub
Structured world models from human videos
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
Human-video dynamics models enable cross-embodiment robot self-improvement via training-free Dynamics-Guided Action Correction, raising success rates from 40% to 81% on seven real-world tasks.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.
citing papers explorer
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
-
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
-
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
-
GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
-
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
GR-3 Technical Report
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
-
Robot Self-Improvement via Human-Video Dynamics Models
Human-video dynamics models enable cross-embodiment robot self-improvement via training-free Dynamics-Guided Action Correction, raising success rates from 40% to 81% on seven real-world tasks.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
World Action Models: A Survey
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.
- From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
- Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints