CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment

Chengyang Zhao; Dongbin Zhao; Haoran Li; He Wang; Wenbo Cui; Yuhui Chen; Zhizheng Zhang

arxiv: 2507.08262 · v2 · pith:BWMQDYSKnew · submitted 2025-07-11 · 💻 cs.RO · cs.AI· cs.CV

CLAR: Learning 3D Representations for Robotic Manipulation by Fusing Masked Reconstruction with Multi-Level Contrastive Alignment

Wenbo Cui , Chengyang Zhao , Yuhui Chen , Haoran Li , Zhizheng Zhang , Dongbin Zhao , He Wang This is my paper

classification 💻 cs.RO cs.AIcs.CV

keywords alignmentlearningmanipulationclarcontrastivefine-grainedgloballocal

0 comments

read the original abstract

The spatial information inherent in 3D point clouds is crucial for robotic manipulation. However, existing 3D pre-training methods face a fundamental trade-off: Masked Autoencoding (MAE) excels at capturing spatial-geometric features but lacks semantics, whereas contrastive learning, while able to distill semantics from 2D foundation models, is ill-suited for the fine-grained details required for manipulation tasks. To address these challenges, we propose CLAR, a novel 3D pre-training framework that synergizes global understanding with fine-grained local alignment. Our framework unifies MAE with global cross-modal contrastive learning to integrate robust spatial awareness with rich semantic understanding. To enhance its focus on fine-grained details, at the local level, we introduce an adaptive alignment mechanism that leverages deformable attention to force precise correspondences between local 3D geometry and 2D visual features, thereby overcoming the limitations of conventional global alignment in manipulation tasks. Extensive experiments in simulation and the real world demonstrate that CLAR achieves state-of-the-art performance, significantly outperforming existing methods in visuomotor policy learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection
cs.RO 2026-07 unverdicted novelty 5.0

A hybrid data collection strategy with a mobile camera arm in dual-arm robots reduces shortcut learning in VLA models and improves spatial generalization to unseen poses and configurations across ACT, Diffusion, and V...