Distillation from visual foundation models to lidar enables frame-wise indoor semantic segmentation without manual annotations, achieving up to 56% mIoU on pseudo labels and 36% on real labels.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Cross4D-JEPA uses dense projection-based cross-modal correspondence to distill features from DINOv2 or V-JEPA 2 into a 4D point encoder, outperforming intra-modal and global cross-modal baselines on four benchmarks while improving label efficiency.
HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency trade-offs on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.
The paper offers a taxonomy of 2D-to-3D adaptation strategies divided into data-centric projection, architecture-centric 3D networks, and hybrid methods that combine both.
citing papers explorer
-
Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model
Distillation from visual foundation models to lidar enables frame-wise indoor semantic segmentation without manual annotations, achieving up to 56% mIoU on pseudo labels and 36% on real labels.
-
Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning
Cross4D-JEPA uses dense projection-based cross-modal correspondence to distill features from DINOv2 or V-JEPA 2 into a 4D point encoder, outperforming intra-modal and global cross-modal baselines on four benchmarks while improving label efficiency.
-
Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection
HGC-Det applies hyperbolic geometry to constrain cross-modal distillation between images and point clouds, with added semantic-guided voxel optimization and feature aggregation, yielding improved accuracy-efficiency trade-offs on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.
-
Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis
The paper offers a taxonomy of 2D-to-3D adaptation strategies divided into data-centric projection, architecture-centric 3D networks, and hybrid methods that combine both.