arxiv: 2501.15830 · v5 · submitted 2025-01-27 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu , Haoming Song , Qizhi Chen , Yuanqi Yao , Xinyi Ye , Yan Ding , Zhigang Wang , Jiayuan Gu

show 3 more authors

Bin Zhao Dong Wang Xuelong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:06 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords SpatialVLAvisual-language-actionspatial representationrobot manipulation3D position encodingaction gridsgeneralizationzero-shot

0 comments

The pith

SpatialVLA uses 3D position encoding and adaptive action grids to build generalist robot manipulation policies with strong generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that spatial understanding is the key to effective robot manipulation by developing SpatialVLA, a visual-language-action model enhanced with specific spatial components. Ego3D Position Encoding adds 3D information to visual observations, while Adaptive Action Grids discretize actions adaptively to learn transferrable spatial knowledge. Pre-trained on 1.1 million real-world episodes, the model achieves zero-shot performance on multiple tasks and demonstrates advantages in complex trajectory inference and multi-task generalization in both simulation and real robots. It also supports efficient fine-tuning for new setups through re-discretization of the action grids. Sympathetic readers would care as this points to a path for creating more adaptable robot foundation models that require less customization per environment.

Core claim

By introducing Ego3D Position Encoding to inject 3D information into the input observations and proposing Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, SpatialVLA facilitates learning generalizable and transferrable spatial action knowledge for cross-robot control. Pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, it learns a generalist manipulation policy that is directly applied in a zero-shot manner, with superior results showing advantages in inferring complex robot motion trajectories and strong in-domain multi-task generalization ability. The Adaptive Action Grids further offer an new or

What carries the argument

The Ego3D Position Encoding and Adaptive Action Grids, which together provide spatial awareness to the visual-language-action model by adding 3D positional data to inputs and using adaptive grids for action representation to support cross-robot transfer.

If this is right

Direct zero-shot application to numerous tasks after pre-training on 1.1M episodes.
Advantage in inferring complex robot motion trajectories in simulation and real-world.
Strong in-domain multi-task generalization across multiple robot environments.
Effective fine-tuning for new simulation and real-world setups via re-discretized action grids.
Exceptional in-distribution generalization and out-of-distribution adaptation capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar spatial injection techniques could be applied to other foundation models in robotics to improve their spatial reasoning without full retraining.
The adaptive discretization might allow for easier integration of new robot hardware by preserving learned spatial priors.
Extending this to longer-horizon tasks or environments with dynamic obstacles could test the limits of the spatial representations.
Combining the model with online adaptation mechanisms might further enhance real-world deployment reliability.

Load-bearing premise

That the reported performance gains stem mainly from the Ego3D Position Encoding and Adaptive Action Grids rather than from the choice of vision-language model base or the volume of pre-training data alone.

What would settle it

Training an identical model without the Ego3D encoding or with non-adaptive fixed action grids on the same 1.1M episodes and evaluating whether the generalization metrics in simulation and real-world tasks match or fall short of the SpatialVLA results.

read the original abstract

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialVLA adds Ego3D encoding and adaptive grids to a large-scale VLA but the abstract gives no numbers or isolating tests, so the attribution to those pieces stays unproven.

read the letter

The main point is that this paper puts forward two concrete spatial mechanisms for visual-language-action robot policies and pre-trains them at 1.1 million episodes. Ego3D Position Encoding injects 3D information into the visual inputs, and Adaptive Action Grids discretize actions in a way that can be re-tuned for new robot setups. That combination is presented as the route to better trajectory inference and cross-robot transfer after pre-training on a vision-language backbone.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatialVLA, a visual-language-action model for robot manipulation that augments a VLM backbone with Ego3D Position Encoding (to inject 3D spatial information into visual observations) and Adaptive Action Grids (to discretize actions adaptively for cross-robot transfer). The model is pre-trained on 1.1 million real-world robot episodes, then evaluated zero-shot on simulation and real-world tasks and further fine-tuned via re-discretization of the action grids for new setups. The central claim is that these spatial representations enable superior trajectory inference, strong in-domain multi-task generalization, and effective out-of-distribution adaptation compared to prior VLA approaches.

Significance. If the performance claims are supported by rigorous quantitative evidence and isolating ablations, the work would meaningfully advance generalist robot policies by demonstrating that explicit spatial encodings and adaptive action discretization can improve generalization across robots and tasks beyond scale alone. The large-scale pre-training regime and commitment to open-sourcing code and models are positive contributions that could facilitate follow-on research.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.
[Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.

minor comments (2)

[§3] Notation for Ego3D Position Encoding and the discretization parameters of Adaptive Action Grids should be introduced with explicit equations or pseudocode in §3 to allow precise reproduction.
[Conclusion] The manuscript states that 'all details and codes will be open-sourced' but does not specify the exact release timeline or repository; adding this information would strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major comment below.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.

Authors: We agree that controlled ablations would provide stronger evidence for the specific contributions of our proposed components. In the revised manuscript, we will include additional experiments that fix the 1.1M-episode pre-training data, VLM backbone, and training procedure, and compare variants with standard positional encodings versus Ego3D Position Encoding, as well as fixed action grids versus Adaptive Action Grids. These ablations will help isolate the impact of the spatial representations. revision: yes
Referee: [Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.

Authors: We acknowledge that early quantitative anchors would enhance the clarity of our claims. We will revise the abstract and the opening of §4 to include specific success rates from our evaluations, comparisons to key baselines, and references to error bars and statistical details provided in the tables and figures. This will allow readers to immediately gauge the empirical support without needing to search further in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre-training and evaluation with no self-referential derivations

full rationale

The paper proposes two spatial components (Ego3D Position Encoding and Adaptive Action Grids), pre-trains a VLA model on 1.1M real-world episodes, then reports zero-shot and fine-tuning results on simulation and real robots. No equations, predictions, or first-principles derivations are presented that reduce to fitted parameters or prior self-citations by construction. All claims are framed as measured experimental outcomes rather than analytic necessities. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the abstract or described method. This is a standard empirical robotics paper whose central claims rest on external benchmarks and data, not internal tautologies.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that adding explicit 3D encodings and adaptive discretization will improve spatial reasoning in transformer-based VLA models trained at scale, plus standard assumptions about large-scale pre-training leading to generalization.

free parameters (1)

Action grid discretization resolution
Adaptive grids require choices of bin sizes or resolution that are likely tuned per robot or task.

axioms (2)

domain assumption Vision-language models can be extended with additional position encodings to incorporate 3D spatial information effectively
Invoked when introducing Ego3D Position Encoding as a direct injection into input observations.
domain assumption Discretized action grids can capture transferable spatial movement knowledge across robots
Central to the claim that re-discretization enables fine-tuning for new setups.

invented entities (2)

Ego3D Position Encoding no independent evidence
purpose: Inject 3D information into the input observations of the visual-language-action model
New encoding scheme proposed to address spatial understanding limitations.
Adaptive Action Grids no independent evidence
purpose: Represent spatial robot movement actions with adaptive discretized action grids
New representation for actions to facilitate generalizable and transferable spatial knowledge.

pith-pipeline@v0.9.0 · 5590 in / 1643 out tokens · 110445 ms · 2026-05-12T06:06:01.441357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing D3_admits_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Why MLLMs Struggle to Determine Object Orientations
cs.CV 2026-04 accept novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
cs.RO 2026-03 conditional novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 6.0

GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
cs.SE 2026-05 unverdicted novelty 6.0

VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Gated Memory Policy
cs.RO 2026-04 unverdicted novelty 5.0

GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
cs.RO 2026-03 unverdicted novelty 5.0

Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
cs.CV 2025-07 unverdicted novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 37 Pith papers · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2022

work page 2022
[2]

Hydra: Hybrid robot actions for imitation learning

Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Proceed- ings of the Conference on Robot Learning (CoRL) , 2023

work page 2023
[3]

Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024

work page 2024
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Pe- ter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review arXiv 2023
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home

work page
[10]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[11]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

work page 2024
[12]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[13]

Open x- embodiment: Robotic learning datasets and rt-x models

Open X-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhiram Maddukuri, Abhishek Gupta, Ab- hishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024

work page 2024
[14]

From play to policy: Conditional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In Proceedings of International Conference on Learning Representations (ICLR) , 2023

work page 2023
[15]

Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/ clvrai/clvr jaco play dataset

work page 2023
[16]

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Proceedings of the Conference on Robot Learning (CoRL), 2024

work page 2024
[17]

Bridge data: Boosting generalization of robotic skills with cross- domain datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022

work page 2022
[18]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In RSS 2023 Workshop on Learning for Task and Motion Planning , 2023

work page 2023
[19]

Scene-llm: Extending language model for 3d visual understanding and reasoning,

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. arXiv preprint arXiv:2403.11401, 2024

work page arXiv 2024
[20]

The organization of learning

Charles R Gallistel. The organization of learning. The MIT Press, 1990

work page 1990
[21]

Polytask: Learning unified policies through behavior distillation.arXiv preprint arXiv:2310.08573,

Siddhant Haldar and Lerrel Pinto. Polytask: Learning unified policies through behavior distillation. arXiv preprint arXiv:2310.08573, 2023

work page arXiv 2023
[22]

Baku: An efficient transformer for multi-task policy learning

Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2024

work page 2024
[23]

Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation

Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation. In Pro- ceedings of Robotics: Science and Systems (RSS) , 2023

work page 2023
[24]

3d- llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2023

work page 2023
[25]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

work page 2024
[26]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022

work page 2022
[27]

Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL) , 2018

work page 2018
[28]

Pris- matic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

work page 2024
[29]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

work page Pith review arXiv 2024
[33]

Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058 , 2024

work page arXiv 2024
[34]

Vision-language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In Proceedings of International Conference on Learning Representations (ICLR), 2024

work page 2024
[35]

Evaluating real-world robot manipulation policies in sim- ulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in sim- ulation. In Proceedings of the Conference on Robot Learning (CoRL), 2024

work page 2024
[36]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review arXiv 2023
[37]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024

work page 2024
[38]

Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Proceedings of Robotics: Science and Systems (RSS) , 2023

work page 2023
[39]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Visuo-spatial working memory

Robert H Logie. Visuo-spatial working memory . Psy- chology Press, 2014

work page 2014
[41]

Multi-stage cable routing through hierarchical imitation learning

Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. IEEE Transactions on Robotics, 40:1476–1491, 2024

work page 2024
[42]

Fmb: a functional manipulation benchmark for generalizable robotic learning

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research , 2024

work page 2024
[43]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , 2023

work page 2023
[44]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Proceedings of the Conference on Robot Learning (CoRL), 2018

work page 2018
[45]

Grounding language with visual affordances over un- structured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023

work page 2023
[46]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Pro- ceedings of the Conference on Robot Learning (CoRL) , 2023

work page 2023
[47]

Learning and retrieval from prior data for skill- based imitation learning

Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023

work page 2023
[48]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science an...

work page 2024
[49]

Actor-mimic: Deep multitask and transfer re- inforcement learning

Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhut- dinov. Actor-mimic: Deep multitask and transfer re- inforcement learning. In Proceedings of International Conference on Learning Representations (ICLR) , 2016

work page 2016
[50]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Child’s Conception of Space: Selected Works vol 4

Jean Piaget. Child’s Conception of Space: Selected Works vol 4. Routledge, 2013

work page 2013
[53]

Livescene: Language embedding interactive radiance fields for physical scene rendering and control

Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Junzhe Li, Bin Zhao, Dong Wang, and Xuelong Li. Livescene: Language embedding interactive radiance fields for physical scene rendering and control. arXiv preprint arXiv:2406.16038, 2024

work page arXiv 2024
[54]

Shared control templates for assistive robotics

Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and J¨orn V ogel. Shared control templates for assistive robotics. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2020

work page 2020
[55]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In Proceedings of the International Conference on Machine Learning (ICML) , 2021

work page 2021
[56]

Latent plans for task- agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022

work page 2022
[57]

Policy distillation

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Raz- van Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Proceedings of International Conference on Learning Representations (ICLR), 2016

work page 2016
[58]

Multi-resolution sensing for real-time control with vision-language models

Saumya Saxena, Mohit Sharma, and Oliver Kroe- mer. Multi-resolution sensing for real-time control with vision-language models. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023

work page 2023
[59]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

work page arXiv 2023
[60]

Mutex: Learning unified policies from multimodal task specifications

Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

work page 2023
[61]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2022

work page 2022
[62]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review arXiv 2024
[63]

Cognitive maps in rats and men

Edward C Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189, 1948

work page 1948
[64]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

work page 2023
[65]

Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024

work page 2024
[66]

ucsd kitchens dataset

Ge Yan, Kris Wu, and Xiaolong Wang. ucsd kitchens dataset. https://github.com/geyan21/rlds dataset builder/ tree/main/ucsd kitchens, 2023

work page 2023
[67]

Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. arXiv preprint arXiv:2412.14171 , 2024

work page arXiv 2024
[68]

Sigmoid loss for language image pre- training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2023

work page 2023
[69]

3d-vla: A 3d vision-language-action generative world model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

work page 2024
[70]

Universal actions for enhanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025

work page arXiv 2025
[71]

arXiv preprint arXiv:2412.10345 (2024) 13

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345 , 2024

work page arXiv 2024
[72]

Train offline, test online: A real robot learning benchmark

Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023

work page 2023
[73]

arXiv preprint arXiv:2409.18125 (2024)

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024
[74]

Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot

Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot. https://sites.google. com/berkeley.edu/fanuc-manipulation, 2023

work page 2023
[75]

Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation

Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022

work page 2022
[76]

Learning generalizable manipulation policies with object-centric 3d representations

Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object-centric 3d representations. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

work page 2023
[77]

Viola: Imitation learning for vision-based manipulation with object proposal priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of the Conference on Robot Learning (CoRL) , 2023. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model Supplementary Material APPENDIX A. Dataset Mixture Details Fig. 9 illust...

work page 2023
[78]

Zero-shot Robot Control Evaluation on WidowX Robot. As described in IV-A, we conducted extensive evaluations of 5 generalist robot manipulation policies across 7 zero- shot tasks, with 11 trials per task on a real-world BridgeV2 WidowX Robot. The specific task settings are: 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 0 20k 40k 60k 80k 100k 1...

work page
[79]

As described in Sec

Adapting to New Robot Setups on Franka Robot. As described in Sec. IV-B, we evaluated the performance of four methods - Diffusion Policy [12], Octo [48], OpenVLA [30], and SpatialVLA- across 13 real-world tasks on a Franka Panda Emika robot, with 11 trials per task. While Diffusion Policy was trained from scratch, Octo, OpenVLA and SpatialVLA were fine-tu...

work page
[80]

Following Sec

Spatial Understanding Capability Evaluation on Franka and WidowX Robot. Following Sec. IV-C, we conducted a comprehensive evaluation of spatial understanding capabilities through 3 zero-shot tasks on the BridgeData V2 WidowX Robot and 1 efficient-finetuning task on the Franka Robot. The detailed task specifications are: • Place plush toy closest to robot ...

work page
[81]

SimplerEnv Evaluation. Tab. X presents the evaluation results of the simpler env on the Google robotic task, encompassing tasks such as Coke can manipulation (horizontal and vertical picking) and drawer operations (opening and closing). On average, SpatialVLA achieves the highest overall visual matching and variant aggre- gation performance with a signifi...

work page

Showing first 80 references.