pith. sign in

arxiv: 2406.20095 · v3 · pith:XVJN3TB2new · submitted 2024-06-28 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords llaramodelsroboticactiondatasetslanguagepretrainedrobot
0
0 comments X
read the original abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

  2. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

  3. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  4. $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    cs.LG 2025-04 unverdicted novelty 6.0

    π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

  5. TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    cs.RO 2024-12 conditional novelty 6.0

    Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.

  6. Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning

    cs.RO 2026-06 conditional novelty 5.0

    VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.

  7. ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    cs.CV 2025-07 unverdicted novelty 5.0

    ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.