GEM: Generative Supervision Helps Embodied Intelligence

Bangguo Li; Diankun Wu; Fangfu Liu; Han Hu; Junliang Ye; Jun Zhu; Ruowen Zhao; Xumin Yu; Yinan Liang; Yongming Rao

arxiv: 2605.28548 · v1 · pith:UB2PU43Anew · submitted 2026-05-27 · 💻 cs.CV

GEM: Generative Supervision Helps Embodied Intelligence

Ruowen Zhao , Bangguo Li , Zuyan Liu , Yinan Liang , Junliang Ye , Fangfu Liu , Diankun Wu , Zhengyi Wang

show 4 more authors

Xumin Yu Yongming Rao Han Hu Jun Zhu

This is my paper

classification 💻 cs.CV

keywords embodiedmodeldepthenvironmentsexecutiongenerativeintelligencemodels

0 comments

read the original abstract

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoi...
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 6.0

Introduces LIBERO-Safety benchmark with parametric scenario generation and 19,664 collision-free demonstrations, then evaluates VLA models to reveal a generalization-safety tension.