Recognition: 2 theorem links
· Lean TheoremSpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Pith reviewed 2026-05-12 06:06 UTC · model grok-4.3
The pith
SpatialVLA uses 3D position encoding and adaptive action grids to build generalist robot manipulation policies with strong generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing Ego3D Position Encoding to inject 3D information into the input observations and proposing Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, SpatialVLA facilitates learning generalizable and transferrable spatial action knowledge for cross-robot control. Pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, it learns a generalist manipulation policy that is directly applied in a zero-shot manner, with superior results showing advantages in inferring complex robot motion trajectories and strong in-domain multi-task generalization ability. The Adaptive Action Grids further offer an new or
What carries the argument
The Ego3D Position Encoding and Adaptive Action Grids, which together provide spatial awareness to the visual-language-action model by adding 3D positional data to inputs and using adaptive grids for action representation to support cross-robot transfer.
If this is right
- Direct zero-shot application to numerous tasks after pre-training on 1.1M episodes.
- Advantage in inferring complex robot motion trajectories in simulation and real-world.
- Strong in-domain multi-task generalization across multiple robot environments.
- Effective fine-tuning for new simulation and real-world setups via re-discretized action grids.
- Exceptional in-distribution generalization and out-of-distribution adaptation capability.
Where Pith is reading between the lines
- Similar spatial injection techniques could be applied to other foundation models in robotics to improve their spatial reasoning without full retraining.
- The adaptive discretization might allow for easier integration of new robot hardware by preserving learned spatial priors.
- Extending this to longer-horizon tasks or environments with dynamic obstacles could test the limits of the spatial representations.
- Combining the model with online adaptation mechanisms might further enhance real-world deployment reliability.
Load-bearing premise
That the reported performance gains stem mainly from the Ego3D Position Encoding and Adaptive Action Grids rather than from the choice of vision-language model base or the volume of pre-training data alone.
What would settle it
Training an identical model without the Ego3D encoding or with non-adaptive fixed action grids on the same 1.1M episodes and evaluating whether the generalization metrics in simulation and real-world tasks match or fall short of the SpatialVLA results.
read the original abstract
In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpatialVLA, a visual-language-action model for robot manipulation that augments a VLM backbone with Ego3D Position Encoding (to inject 3D spatial information into visual observations) and Adaptive Action Grids (to discretize actions adaptively for cross-robot transfer). The model is pre-trained on 1.1 million real-world robot episodes, then evaluated zero-shot on simulation and real-world tasks and further fine-tuned via re-discretization of the action grids for new setups. The central claim is that these spatial representations enable superior trajectory inference, strong in-domain multi-task generalization, and effective out-of-distribution adaptation compared to prior VLA approaches.
Significance. If the performance claims are supported by rigorous quantitative evidence and isolating ablations, the work would meaningfully advance generalist robot policies by demonstrating that explicit spatial encodings and adaptive action discretization can improve generalization across robots and tasks beyond scale alone. The large-scale pre-training regime and commitment to open-sourcing code and models are positive contributions that could facilitate follow-on research.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.
- [Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.
minor comments (2)
- [§3] Notation for Ego3D Position Encoding and the discretization parameters of Adaptive Action Grids should be introduced with explicit equations or pseudocode in §3 to allow precise reproduction.
- [Conclusion] The manuscript states that 'all details and codes will be open-sourced' but does not specify the exact release timeline or repository; adding this information would strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.
Authors: We agree that controlled ablations would provide stronger evidence for the specific contributions of our proposed components. In the revised manuscript, we will include additional experiments that fix the 1.1M-episode pre-training data, VLM backbone, and training procedure, and compare variants with standard positional encodings versus Ego3D Position Encoding, as well as fixed action grids versus Adaptive Action Grids. These ablations will help isolate the impact of the spatial representations. revision: yes
-
Referee: [Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.
Authors: We acknowledge that early quantitative anchors would enhance the clarity of our claims. We will revise the abstract and the opening of §4 to include specific success rates from our evaluations, comparisons to key baselines, and references to error bars and statistical details provided in the tables and figures. This will allow readers to immediately gauge the empirical support without needing to search further in the paper. revision: yes
Circularity Check
No circularity: empirical pre-training and evaluation with no self-referential derivations
full rationale
The paper proposes two spatial components (Ego3D Position Encoding and Adaptive Action Grids), pre-trains a VLA model on 1.1M real-world episodes, then reports zero-shot and fine-tuning results on simulation and real robots. No equations, predictions, or first-principles derivations are presented that reduce to fitted parameters or prior self-citations by construction. All claims are framed as measured experimental outcomes rather than analytic necessities. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the abstract or described method. This is a standard empirical robotics paper whose central claims rest on external benchmarks and data, not internal tautologies.
Axiom & Free-Parameter Ledger
free parameters (1)
- Action grid discretization resolution
axioms (2)
- domain assumption Vision-language models can be extended with additional position encodings to incorporate 3D spatial information effectively
- domain assumption Discretized action grids can capture transferable spatial movement knowledge across robots
invented entities (2)
-
Ego3D Position Encoding
no independent evidence
-
Adaptive Action Grids
no independent evidence
Lean theorems connected to this paper
-
Foundation.DimensionForcingD3_admits_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 43 Pith papers
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
-
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
Why MLLMs Struggle to Determine Object Orientations
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
-
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2022
work page 2022
-
[2]
Hydra: Hybrid robot actions for imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Proceed- ings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[3]
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024
work page 2024
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Pe- ter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Berkeley UR5 demonstration dataset
Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home
-
[10]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[11]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024
work page 2024
-
[12]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[13]
Open x- embodiment: Robotic learning datasets and rt-x models
Open X-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhiram Maddukuri, Abhishek Gupta, Ab- hishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024
work page 2024
-
[14]
From play to policy: Conditional behavior generation from uncurated robot data
Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In Proceedings of International Conference on Learning Representations (ICLR) , 2023
work page 2023
-
[15]
Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/ clvrai/clvr jaco play dataset
work page 2023
-
[16]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Proceedings of the Conference on Robot Learning (CoRL), 2024
work page 2024
-
[17]
Bridge data: Boosting generalization of robotic skills with cross- domain datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022
work page 2022
-
[18]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In RSS 2023 Workshop on Learning for Task and Motion Planning , 2023
work page 2023
-
[19]
Scene-llm: Extending language model for 3d visual understanding and reasoning,
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. arXiv preprint arXiv:2403.11401, 2024
-
[20]
Charles R Gallistel. The organization of learning. The MIT Press, 1990
work page 1990
-
[21]
Polytask: Learning unified policies through behavior distillation.arXiv preprint arXiv:2310.08573,
Siddhant Haldar and Lerrel Pinto. Polytask: Learning unified policies through behavior distillation. arXiv preprint arXiv:2310.08573, 2023
-
[22]
Baku: An efficient transformer for multi-task policy learning
Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2024
work page 2024
-
[23]
Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation
Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation. In Pro- ceedings of Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[24]
3d- llm: Injecting the 3d world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2023
work page 2023
-
[25]
An embodied generalist agent in 3d world
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024
work page 2024
-
[26]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022
work page 2022
-
[27]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL) , 2018
work page 2018
-
[28]
Pris- matic vlms: Investigating the design space of visually- conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. In Proceedings of the International Conference on Machine Learning (ICML) , 2024
work page 2024
-
[29]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024
work page Pith review arXiv 2024
-
[33]
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058 , 2024
-
[34]
Vision-language foundation models as effective robot imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In Proceedings of International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[35]
Evaluating real-world robot manipulation policies in sim- ulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in sim- ulation. In Proceedings of the Conference on Robot Learning (CoRL), 2024
work page 2024
-
[36]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024
work page 2024
-
[38]
Robot learning on the job: Human- in-the-loop autonomy and learning during deployment
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Proceedings of Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[39]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Robert H Logie. Visuo-spatial working memory . Psy- chology Press, 2014
work page 2014
-
[41]
Multi-stage cable routing through hierarchical imitation learning
Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. IEEE Transactions on Robotics, 40:1476–1491, 2024
work page 2024
-
[42]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research , 2024
work page 2024
-
[43]
Interactive language: Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , 2023
work page 2023
-
[44]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Proceedings of the Conference on Robot Learning (CoRL), 2018
work page 2018
-
[45]
Grounding language with visual affordances over un- structured data
Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023
work page 2023
-
[46]
Structured world models from human videos
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Pro- ceedings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[47]
Learning and retrieval from prior data for skill- based imitation learning
Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023
work page 2023
-
[48]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science an...
work page 2024
-
[49]
Actor-mimic: Deep multitask and transfer re- inforcement learning
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhut- dinov. Actor-mimic: Deep multitask and transfer re- inforcement learning. In Proceedings of International Conference on Learning Representations (ICLR) , 2016
work page 2016
-
[50]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Child’s Conception of Space: Selected Works vol 4
Jean Piaget. Child’s Conception of Space: Selected Works vol 4. Routledge, 2013
work page 2013
-
[53]
Livescene: Language embedding interactive radiance fields for physical scene rendering and control
Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Junzhe Li, Bin Zhao, Dong Wang, and Xuelong Li. Livescene: Language embedding interactive radiance fields for physical scene rendering and control. arXiv preprint arXiv:2406.16038, 2024
-
[54]
Shared control templates for assistive robotics
Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and J¨orn V ogel. Shared control templates for assistive robotics. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2020
work page 2020
-
[55]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In Proceedings of the International Conference on Machine Learning (ICML) , 2021
work page 2021
-
[56]
Latent plans for task- agnostic offline reinforcement learning
Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022
work page 2022
-
[57]
Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Raz- van Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Proceedings of International Conference on Learning Representations (ICLR), 2016
work page 2016
-
[58]
Multi-resolution sensing for real-time control with vision-language models
Saumya Saxena, Mohit Sharma, and Oliver Kroe- mer. Multi-resolution sensing for real-time control with vision-language models. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023
work page 2023
-
[59]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023
-
[60]
Mutex: Learning unified policies from multimodal task specifications
Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications. In Proceedings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[61]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2022
work page 2022
-
[62]
PaliGemma 2: A Family of Versatile VLMs for Transfer
Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555, 2024
work page internal anchor Pith review arXiv 2024
-
[63]
Cognitive maps in rats and men
Edward C Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189, 1948
work page 1948
-
[64]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Proceedings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[65]
Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024
work page 2024
-
[66]
Ge Yan, Kris Wu, and Xiaolong Wang. ucsd kitchens dataset. https://github.com/geyan21/rlds dataset builder/ tree/main/ucsd kitchens, 2023
work page 2023
-
[67]
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. arXiv preprint arXiv:2412.14171 , 2024
-
[68]
Sigmoid loss for language image pre- training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2023
work page 2023
-
[69]
3d-vla: A 3d vision-language-action generative world model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. In Proceedings of the International Conference on Machine Learning (ICML) , 2024
work page 2024
-
[70]
Universal actions for enhanced embodied foundation models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025
-
[71]
arXiv preprint arXiv:2412.10345 (2024) 13
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345 , 2024
-
[72]
Train offline, test online: A real robot learning benchmark
Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023
work page 2023
-
[73]
arXiv preprint arXiv:2409.18125 (2024)
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024
-
[74]
Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot
Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot. https://sites.google. com/berkeley.edu/fanuc-manipulation, 2023
work page 2023
-
[75]
Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation
Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022
work page 2022
-
[76]
Learning generalizable manipulation policies with object-centric 3d representations
Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object-centric 3d representations. In Proceedings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[77]
Viola: Imitation learning for vision-based manipulation with object proposal priors
Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of the Conference on Robot Learning (CoRL) , 2023. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model Supplementary Material APPENDIX A. Dataset Mixture Details Fig. 9 illust...
work page 2023
-
[78]
Zero-shot Robot Control Evaluation on WidowX Robot. As described in IV-A, we conducted extensive evaluations of 5 generalist robot manipulation policies across 7 zero- shot tasks, with 11 trials per task on a real-world BridgeV2 WidowX Robot. The specific task settings are: 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 0 20k 40k 60k 80k 100k 1...
-
[79]
Adapting to New Robot Setups on Franka Robot. As described in Sec. IV-B, we evaluated the performance of four methods - Diffusion Policy [12], Octo [48], OpenVLA [30], and SpatialVLA- across 13 real-world tasks on a Franka Panda Emika robot, with 11 trials per task. While Diffusion Policy was trained from scratch, Octo, OpenVLA and SpatialVLA were fine-tu...
-
[80]
Spatial Understanding Capability Evaluation on Franka and WidowX Robot. Following Sec. IV-C, we conducted a comprehensive evaluation of spatial understanding capabilities through 3 zero-shot tasks on the BridgeData V2 WidowX Robot and 1 efficient-finetuning task on the Franka Robot. The detailed task specifications are: • Place plush toy closest to robot ...
-
[81]
SimplerEnv Evaluation. Tab. X presents the evaluation results of the simpler env on the Google robotic task, encompassing tasks such as Coke can manipulation (horizontal and vertical picking) and drawer operations (opening and closing). On average, SpatialVLA achieves the highest overall visual matching and variant aggre- gation performance with a signifi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.