On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline

Aravind Rajeswaran; Hao Su; Huazhe Xu; Nicklas Hansen; Tongzhou Mu; Xiaolong Wang; Yanjie Ze; Zhecheng Yuan

arxiv: 2212.05749 · v2 · pith:U2D2SCYGnew · submitted 2022-12-12 · 💻 cs.LG · cs.CV· cs.RO

On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline

Nicklas Hansen , Zhecheng Yuan , Yanjie Ze , Tongzhou Mu , Aravind Rajeswaran , Hao Su , Huazhe Xu , Xiaolong Wang This is my paper

classification 💻 cs.LG cs.CVcs.RO

keywords baselinecontrolpre-trainingvisuo-motordatasetslearning-from-scratchsimpleaccurately

0 comments

read the original abstract

In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets -- across a variety of algorithms, task domains, and metrics in simulation and on a real robot. Our results demonstrate that these methods are hindered by a significant domain gap between the pre-training datasets and current benchmarks for visuo-motor control, which is alleviated by finetuning. Based on our findings, we provide recommendations for future research in pre-training for control and hope that our simple yet strong baseline will aid in accurately benchmarking progress in this area.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
cs.CV 2025-11 unverdicted novelty 7.0

SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
IGen: Scalable Data Generation for Robot Learning from Open-World Images
cs.RO 2025-12 unverdicted novelty 6.0

IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
cs.RO 2024-09 unverdicted novelty 6.0

Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
cs.RO 2026-02 unverdicted novelty 5.0

An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.