pith. sign in

hub Canonical reference

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Canonical reference. 80% of citing Pith papers cite this work as background.

93 Pith papers citing it
Background 80% of classified citations
abstract

We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

hub tools

citation-role summary

background 30 method 4 baseline 1

citation-polarity summary

claims ledger

  • abstract We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities,
  • background Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM Autoregression GR1 [86], grmg [ 87], GR2 [88], Co TVLA [89], WorldVLA [90], rynnvla2 [91] VLA-JEP A [92], F1-VLA [93] Diffusion-based P AD [21], VideoVLA [94], UWM [20], DreamZero [ 17], CosmosPolicy [16], FLARE [95], UV A [96] FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], Ling

co-cited works

clear filters

representative citing papers

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

T-Rex: Tactile-Reactive Dexterous Manipulation

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

Next Forcing augments video generation models with auxiliary multi-chunk prediction modules to achieve faster training convergence, higher accuracy at high frame rates, and 2x faster inference on world modeling benchmarks.

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

cs.RO · 2026-06-03 · unverdicted · novelty 6.0

OSCAR finetunes Cosmos-Predict2.5-2B on a deduplicated multi-embodiment robotics dataset with kinematic skeleton conditioning, claiming better action following and significant correlation between virtual and real robot policy evaluations.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 21 · internal anchor

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  • Kairos: A Native World Model Stack for Physical AI cs.AI · 2026-06-15 · unverdicted · none · ref 149 · internal anchor

    Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.