hub Canonical reference

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li · 2024 · cs.RO · arXiv 2410.06158

Canonical reference. 80% of citing Pith papers cite this work as background.

71 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 4 baseline 1

citation-polarity summary

background 28 use method 3 baseline 2 unclear 2

claims ledger

abstract We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities,
background Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM Autoregression GR1 [86], grmg [ 87], GR2 [88], Co TVLA [89], WorldVLA [90], rynnvla2 [91] VLA-JEP A [92], F1-VLA [93] Diffusion-based P AD [21], VideoVLA [94], UWM [20], DreamZero [ 17], CosmosPolicy [16], FLARE [95], UV A [96] FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], Ling

co-cited works

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

cs.RO · 2026-04-28 · unverdicted · novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

cs.RO · 2026-04-09 · unverdicted · novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

cs.RO · 2025-05-19 · unverdicted · novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.

Turning Video Models into Generalist Robot Policies

cs.RO · 2026-05-27 · unverdicted · novelty 6.0

Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.

Action with Visual Primitives

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

citing papers explorer

Showing 50 of 71 citing papers.

Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 4 · internal anchor
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 11 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 90 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models cs.RO · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 28 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO · 2026-02-23 · unverdicted · none · ref 5 · internal anchor
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 8 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models cs.RO · 2025-05-19 · unverdicted · none · ref 49 · internal anchor
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 11 · internal anchor
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies cs.CV · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.
Turning Video Models into Generalist Robot Policies cs.RO · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.
Action with Visual Primitives cs.RO · 2026-05-21 · unverdicted · none · ref 23 · internal anchor
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
UAM: A Dual-Stream Perspective on Forgetting in VLA Training cs.CV · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 10 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 13 · internal anchor
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 28 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation cs.RO · 2026-04-11 · unverdicted · none · ref 12 · internal anchor
DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds cs.RO · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO · 2026-04-03 · conditional · none · ref 10 · internal anchor
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 26 · internal anchor
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation cs.RO · 2026-03-16 · unverdicted · none · ref 4 · internal anchor
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 17 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 12 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
AstraNav-World: World Model for Foresight Control and Consistency cs.CV · 2025-12-25 · unverdicted · none · ref 6 · internal anchor
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models cs.RO · 2025-12-10 · unverdicted · none · ref 8 · internal anchor
HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation cs.RO · 2025-10-09 · unverdicted · none · ref 3 · internal anchor
R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.
Video Generators are Robot Policies cs.RO · 2025-08-01 · conditional · none · ref 47 · internal anchor
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation cs.CV · 2025-07-17 · unverdicted · none · ref 7 · internal anchor
AnyPos automates task-agnostic action collection and inverse-dynamics modeling with arm/end-effector decoupling plus a direction-aware decoder, delivering 51% higher test accuracy and 30-40% better success rates on bimanual tasks.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge cs.CV · 2025-07-06 · unverdicted · none · ref 18 · internal anchor
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.
ReSim: Reliable World Simulation for Autonomous Driving cs.CV · 2025-06-11 · unverdicted · none · ref 54 · internal anchor
ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module for estimating action quality from simulated futures.
Real-Time Execution of Action Chunking Flow Policies cs.RO · 2025-06-09 · unverdicted · none · ref 9 · internal anchor
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
FLARE: Robot Learning with Implicit World Modeling cs.RO · 2025-05-21 · unverdicted · none · ref 37 · internal anchor
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 30 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models cs.CV · 2025-03-27 · unverdicted · none · ref 5 · internal anchor
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 19 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 11 · internal anchor
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation cs.RO · 2024-12-18 · accept · none · ref 14 · internal anchor
RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation cs.RO · 2024-11-29 · unverdicted · none · ref 11 · internal anchor
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation cs.CV · 2026-06-26 · unverdicted · none · ref 6 · internal anchor
PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.
World Models for Robotic Manipulation: A Survey cs.RO · 2026-05-27 · accept · none · ref 32 · internal anchor
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 19 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer