LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Cristina Mata; Jinghuan Shang; Jongwoo Park; Kanchana Ranasinghe; Kumara Kahatapitiya; Michael S. Ryoo; Mu Cai; Ryan Burgert; Xiang Li; Yong Jae Lee

arxiv: 2406.20095 · v3 · pith:XVJN3TB2new · submitted 2024-06-28 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li , Cristina Mata , Jongwoo Park , Kumara Kahatapitiya , Yoo Sung Jang , Jinghuan Shang , Kanchana Ranasinghe , Ryan Burgert

show 3 more authors

Mu Cai Yong Jae Lee Michael S. Ryoo

This is my paper

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG

keywords llaramodelsroboticactiondatasetslanguagepretrainedrobot

0 comments

read the original abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
cs.RO 2025-05 unverdicted novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
cs.LG 2025-04 unverdicted novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
cs.RO 2024-12 conditional novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
cs.RO 2026-06 conditional novelty 5.0

VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
cs.CV 2025-07 unverdicted novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.