ESAM++: Efficient Online 3D Perception on the Edge

Andrea Colaco; Ehsan Adeli; Lavisha Aggarwal; Marc Niethammer; Qin Liu; Saptarashmi Bandyopadhyay; Vikas Bahirwani

arxiv: 2605.29505 · v1 · pith:2RMSC66Mnew · submitted 2026-05-28 · 💻 cs.CV

ESAM++: Efficient Online 3D Perception on the Edge

Qin Liu , Lavisha Aggarwal , Saptarashmi Bandyopadhyay , Vikas Bahirwani , Marc Niethammer , Ehsan Adeli , Andrea Colaco This is my paper

Pith reviewed 2026-06-29 08:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D instance segmentationonline 3D perceptionedge computingsparse feature pyramidpoint cloud processinglightweight networksreal-time robotics

0 comments

The pith

ESAM++ replaces the 3D sparse UNet with a lighter Sparse Feature Pyramid Network to deliver competitive online 3D segmentation up to three times faster and with half the model size on edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the main computational cost in real-time 3D scene perception for robotics and AR/VR on resource-limited hardware. It replaces the heavy 3D sparse UNet used in the prior ESAM method with a new 3D Sparse Feature Pyramid Network that extracts multi-scale geometric features from streaming point clouds at lower cost. Tests across ScanNet, ScanNet200, SceneNN, and 3RScan show the resulting ESAM++ system keeps accuracy while cutting inference time by up to 3x and shrinking the model by 2x. This shift matters because it removes the need for GPU acceleration and makes fine-grained 3D instance segmentation practical on edge hardware where privacy and power constraints rule out cloud offloading.

Core claim

ESAM++ introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds, replacing the computationally dominant 3D sparse UNet of the original ESAM and thereby achieving competitive accuracy on four segmentation benchmarks with up to three times faster inference and a two times smaller model size.

What carries the argument

The 3D Sparse Feature Pyramid Network (SFPN), which extracts multi-scale geometric features from point clouds with reduced overhead in place of a full 3D sparse UNet.

If this is right

ESAM++ runs online 3D instance segmentation at interactive rates on CPUs or edge chips without GPUs.
The model size drops by a factor of two while accuracy stays competitive across four standard benchmarks.
Streaming point-cloud perception becomes feasible in privacy-sensitive or power-limited settings such as mobile robots.
The same SFPN block can be swapped into other pipelines that currently rely on 3D sparse UNets for feature extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pyramid-style replacements might reduce compute in other sparse-data tasks such as 3D object detection or semantic mapping.
If the accuracy holds on new datasets, the approach could extend real-time 3D perception to consumer phones or embedded cameras.
The speed gain opens the door to higher frame-rate or higher-resolution inputs that were previously too expensive.

Load-bearing premise

The Sparse Feature Pyramid Network can pull out enough multi-scale detail from streaming point clouds to keep the same fine-grained segmentation quality that the original 3D sparse UNet provided.

What would settle it

Measure instance segmentation mIoU or AP on ScanNet using the released ESAM++ weights and check whether accuracy falls more than a few points below the original ESAM while the reported speed and size gains remain.

Figures

Figures reproduced from arXiv: 2605.29505 by Andrea Colaco, Ehsan Adeli, Lavisha Aggarwal, Marc Niethammer, Qin Liu, Saptarashmi Bandyopadhyay, Vikas Bahirwani.

**Figure 1.** Figure 1: Overview of ESAM and ESAM++. ESAM [36] (right figure) is the state-of-the-art approach for online 3D scene perception. We identify two key efficiency bottlenecks in its design: (1) the use of a frozen Visual Foundation Model (VFM) based on FastSAM [44], and (2) a point cloud encoder built upon a 3D sparse UNet. This work focuses on optimizing the latter, while improvements to the VFM component are left for… view at source ↗

**Figure 2.** Figure 2: Computational analysis of the 3D sparse UNet used in ESAM. The left diagram shows architecture details; the right chart highlights parameter and latency distribution across each layer. Top layers (e.g., Layer 0) cause high latency due to voxel density and large kernels, while bottom layers (e.g., Layer 4) dominate model size. This motivates a more balanced encoder for edge use. The memory adapter module [3… view at source ↗

**Figure 3.** Figure 3: Architecture of the proposed SFPN. SFPN is a lightweight encoder-decoder for efficient multi-scale feature extraction from 3D point clouds. The encoder downsamples features through sparse convolutions and residual blocks, while the decoder upsamples and refines them. SFPN uniquely concatenates upsampled features from all decoder stages before an MLP generates the final point-wise features. We implement thr… view at source ↗

**Figure 4.** Figure 4: Ablation study of the SFPN architecture: (a) full model, (b) without upsampled feature fusion, (c) without the feature pyramid, and (d) without skip connections. Comparisons results are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of noisy camera poses. We evaluate our method on the ScanNet200 dataset for online class-agnostic 3D instance segmentation. The results show that our method remains robust under camera pose noise of up to 5%. However, performance degrades significantly when the noise level increases to 20%, leading to failure cases [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESAM++ replaces the heavy 3D UNet in ESAM with a new SFPN to cut inference time and model size for online point-cloud segmentation on edge hardware, but the abstract gives no numbers or ablations to show the accuracy trade-off is actually acceptable.

read the letter

The main takeaway is a targeted swap: drop the 3D sparse UNet that dominates ESAM runtime and put in a lighter 3D Sparse Feature Pyramid Network (SFPN) that still feeds multi-scale features to the SAM-based segmentation head. The paper claims this yields up to 3x faster inference and 2x smaller models while staying competitive on ScanNet, ScanNet200, SceneNN, and 3RScan.

What is new is the SFPN itself—an extension of feature-pyramid ideas to sparse 3D streaming data in an online setting. The work does a clear job naming the practical bottleneck (UNet cost on edge devices) and showing how the architecture change addresses it without changing the rest of the ESAM pipeline.

The soft spot is the evidence. The abstract states competitive accuracy and concrete speed/size gains, yet supplies none of the actual metrics, error bars, or controlled comparisons. Without those, it is impossible to tell whether the lighter pyramid operations preserve the fine local geometry the original UNet captured, especially under streaming constraints. The central assumption—that SFPN extracts enough multi-scale features to keep downstream instance segmentation quality intact—remains untested in the provided summary.

This paper is aimed at people building real-time 3D perception stacks for robotics or AR/VR on resource-limited hardware. A reader who needs a drop-in lighter backbone for similar SAM-based systems would find the architecture description useful once the numbers appear.

It deserves peer review if the full manuscript includes the missing quantitative results and ablations; the problem is relevant and the change is a reasonable incremental step.

Referee Report

2 major / 1 minor

Summary. The paper proposes ESAM++, a lightweight alternative to EmbodiedSAM (ESAM) for online 3D instance segmentation on edge devices. It introduces a 3D Sparse Feature Pyramid Network (SFPN) to replace the computationally expensive 3D sparse UNet for extracting multi-scale geometric features from streaming point clouds, claiming competitive accuracy on ScanNet, ScanNet200, SceneNN, and 3RScan while delivering up to 3x faster inference and 2x smaller model size.

Significance. If the empirical claims hold with rigorous validation, the work would enable practical real-time 3D perception on resource-constrained hardware without GPUs, with direct relevance to robotics, AR/VR, and autonomous systems. The SFPN design offers a potentially scalable approach to multi-scale feature extraction in streaming settings.

major comments (2)

[Abstract] Abstract: the central claim of 'competitive accuracy' with concrete speed/size gains rests on empirical comparison to ESAM, yet the abstract (and by extension the evaluation) supplies no quantitative metrics, error bars, ablation details, or description of how the comparison to the original 3D sparse UNet was controlled; this absence directly undermines assessment of whether SFPN preserves fine-grained segmentation quality.
[Method (SFPN)] Method section describing SFPN: the assertion that the lighter pyramid operations extract sufficient multi-scale features to support downstream SAM-based segmentation at the same quality as the 3D sparse UNet lacks any direct ablation, parameter comparison, or streaming-specific analysis showing preservation of local geometry; this is load-bearing for the generality claim across the four benchmarks.

minor comments (1)

[Abstract] Abstract: the phrasing 'up to 3 times faster inference with a 2 times smaller model size' is imprecise without reference to specific hardware, batch sizes, or exact measured values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract would benefit from explicit quantitative metrics and that additional analysis on SFPN would strengthen the method section. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'competitive accuracy' with concrete speed/size gains rests on empirical comparison to ESAM, yet the abstract (and by extension the evaluation) supplies no quantitative metrics, error bars, ablation details, or description of how the comparison to the original 3D sparse UNet was controlled; this absence directly undermines assessment of whether SFPN preserves fine-grained segmentation quality.

Authors: We agree the abstract would be strengthened by including specific numbers. In the revision we will add the key quantitative results (e.g., mAP/mIoU on each benchmark, inference latency, and model size) directly into the abstract while retaining the high-level claims. Error bars, full ablation tables, and the controlled experimental protocol (identical SAM backbone, same streaming input settings, and evaluation metrics as ESAM) are already reported in Section 4 and the supplementary material; we will add a brief cross-reference in the abstract to these sections. This addresses the concern about assessing fine-grained quality preservation. revision: yes
Referee: [Method (SFPN)] Method section describing SFPN: the assertion that the lighter pyramid operations extract sufficient multi-scale features to support downstream SAM-based segmentation at the same quality as the 3D sparse UNet lacks any direct ablation, parameter comparison, or streaming-specific analysis showing preservation of local geometry; this is load-bearing for the generality claim across the four benchmarks.

Authors: We will add a dedicated ablation subsection (new Table in Section 3 or 4) that directly compares SFPN against the 3D sparse UNet on parameter count, FLOPs, and downstream instance segmentation metrics (AP, mIoU) across the four benchmarks. We will also include a streaming-specific analysis with visualizations of local geometry preservation (e.g., feature maps at different scales on sequential point clouds) and quantitative metrics such as feature similarity or boundary F-score. These additions will make the generality claim explicit and address the load-bearing nature of the SFPN design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical replacement of UNet by SFPN validated on external benchmarks

full rationale

The paper's core contribution is an empirical architecture change (SFPN for multi-scale streaming features) whose performance is measured by direct comparison to the prior ESAM method on ScanNet/ScanNet200/SceneNN/3RScan. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. The assumption that SFPN preserves segmentation quality is tested experimentally rather than asserted via self-definition or self-citation chains. Self-citation of ESAM is present but not load-bearing for the new result, as the evaluation uses independent datasets and metrics. This is a standard non-circular empirical improvement paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The contribution centers on a new network design (SFPN) rather than new physical entities. The work inherits standard computer-vision assumptions about the utility of SAM for 3D tasks and the feasibility of sparse multi-scale feature extraction.

axioms (2)

domain assumption The Segment Anything Model can be leveraged for real-time 3D instance segmentation when paired with an appropriate point-cloud backbone.
The entire ESAM++ effort builds on this premise from the cited ESAM paper.
domain assumption Streaming 3D point clouds admit efficient multi-scale feature extraction via a sparse feature pyramid without requiring the capacity of a full 3D UNet.
This is the load-bearing modeling choice that justifies replacing the UNet.

invented entities (1)

3D Sparse Feature Pyramid Network (SFPN) no independent evidence
purpose: Lightweight multi-scale geometric feature extractor for streaming point clouds on edge hardware.
New architecture proposed by the authors; no external independent evidence is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5789 in / 1563 out tokens · 40334 ms · 2026-06-29T08:47:53.156067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages · 4 internal anchors

[1]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,
[3]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,
[4]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5

2017
[5]

Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1

2020
[6]

A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015

Alessandro Giusti, J ´erˆome Guzzi, Dan C Cires ¸an, Fang-Lin He, Juan P Rodr ´ıguez, Flavio Fontana, Matthias Faessler, Christian Forster, J¨urgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015. 1

2015
[7]

3d semantic segmentation with submani- fold sparse convolutional networks

Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2

2018
[8]

Scenenn: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In2016 fourth in- ternational conference on 3D vision (3DV), pages 92–101. Ieee, 2016. 5

2016
[9]

Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021

Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2

2021
[10]

Learning to drive in a day

Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In2019 international conference on robotics and automation (ICRA), pages 8248–8254. IEEE, 2019. 1

2019
[11]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 3, 6

2023
[12]

Oneformer3d: One transformer for unified point cloud segmentation

Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 6, 7

2024
[13]

Top-down beats bottom-up in 3d in- stance segmentation

Maksim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Top-down beats bottom-up in 3d in- stance segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 3566–3574, 2024. 6, 7

2024
[14]

Segment and recognize anything at any granularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. InEu- ropean Conference on Computer Vision, pages 467–484. Springer, 2024. 2, 3, 6

2024
[15]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2

2017
[16]

Ins-conv: Incremental sparse convolution for online 3d seg- mentation

Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18975– 18984, 2022. 1, 2, 6, 7

2022
[17]

Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks

John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. In2017 IEEE In- ternational Conference on Robotics and automation (ICRA), pages 4628–4635. IEEE, 2017. 2

2017
[18]

6-dof graspnet: Variational grasp generation for object manipula- tion

Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2901–2910, 2019. 1, 2

2019
[19]

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things

Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 4205–4212. IEEE, 2019. 2

2019
[20]

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4018–4028, 2024. 6

2024
[21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[23]

Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002

Wayne Piekarski and Bruce Thomas. Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002. 1

2002
[24]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
[25]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2

2017
[26]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 3

2024
[27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022. 5

2022
[29]

Unscene3d: Unsupervised 3d instance segmentation for indoor scenes

David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19957–19967, 2024. 3

2024
[30]

Kpconv: Flexible and deformable convolution for point clouds

Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2

2019
[31]

Rio: 3d object instance re- localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 5

2019
[32]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2, 8

work page arXiv 2025
[33]

Pointpwc-net: A coarse-to-fine network for super- vised and self-supervised scene flow estimation on 3d point clouds.arXiv preprint arXiv:1911.12408, 2019

Wenxuan Wu, Zhiyuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: A coarse-to-fine network for super- vised and self-supervised scene flow estimation on 3d point clouds.arXiv preprint arXiv:1911.12408, 2019. 2

work page arXiv 1911
[34]

Sonata: Self- supervised learning of reliable point representations.arXiv preprint arXiv:2503.16429, 2025

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations.arXiv preprint arXiv:2503.16429, 2025. 2

work page arXiv 2025
[35]

Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707,

Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707, 2023. 3, 6

work page arXiv 2023
[36]

Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,

work page arXiv
[37]

Memory-based adapters for online 3d scene perception

Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21604–21613, 2024. 1, 2, 4, 5, 6, 7

2024
[38]

SAM3D: Segment anything in 3D scenes.arXiv preprint arXiv:2306.03908, 2023

Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023. 1, 3, 6

work page arXiv 2023
[39]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision, pages 162–
[40]

Sai3d: Segment any instance in 3d scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3292–3302, 2024. 3, 6

2024
[41]

Fusion-aware point convolution for online semantic 3d scene segmentation

Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4534– 4543, 2020. 2

2020
[42]

3d-aware object goal navigation via simultaneous exploration and identification

Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672–6682, 2023. 1, 2

2023
[43]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

2021
[44]

Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment any- thing.arXiv preprint arXiv:2306.12156, 2023. 1, 3, 6

work page arXiv 2023

[1] [1]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,

[3] [3]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,

[4] [4]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5

2017

[5] [5]

Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1

2020

[6] [6]

A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015

Alessandro Giusti, J ´erˆome Guzzi, Dan C Cires ¸an, Fang-Lin He, Juan P Rodr ´ıguez, Flavio Fontana, Matthias Faessler, Christian Forster, J¨urgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015. 1

2015

[7] [7]

3d semantic segmentation with submani- fold sparse convolutional networks

Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2

2018

[8] [8]

Scenenn: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In2016 fourth in- ternational conference on 3D vision (3DV), pages 92–101. Ieee, 2016. 5

2016

[9] [9]

Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021

Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2

2021

[10] [10]

Learning to drive in a day

Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In2019 international conference on robotics and automation (ICRA), pages 8248–8254. IEEE, 2019. 1

2019

[11] [11]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 3, 6

2023

[12] [12]

Oneformer3d: One transformer for unified point cloud segmentation

Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 6, 7

2024

[13] [13]

Top-down beats bottom-up in 3d in- stance segmentation

Maksim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Top-down beats bottom-up in 3d in- stance segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 3566–3574, 2024. 6, 7

2024

[14] [14]

Segment and recognize anything at any granularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. InEu- ropean Conference on Computer Vision, pages 467–484. Springer, 2024. 2, 3, 6

2024

[15] [15]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2

2017

[16] [16]

Ins-conv: Incremental sparse convolution for online 3d seg- mentation

Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18975– 18984, 2022. 1, 2, 6, 7

2022

[17] [17]

Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks

John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. In2017 IEEE In- ternational Conference on Robotics and automation (ICRA), pages 4628–4635. IEEE, 2017. 2

2017

[18] [18]

6-dof graspnet: Variational grasp generation for object manipula- tion

Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2901–2910, 2019. 1, 2

2019

[19] [19]

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things

Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 4205–4212. IEEE, 2019. 2

2019

[20] [20]

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4018–4028, 2024. 6

2024

[21] [21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[23] [23]

Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002

Wayne Piekarski and Bruce Thomas. Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002. 1

2002

[24] [24]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

[25] [25]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2

2017

[26] [26]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 3

2024

[27] [27]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022. 5

2022

[29] [29]

Unscene3d: Unsupervised 3d instance segmentation for indoor scenes

David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19957–19967, 2024. 3

2024

[30] [30]

Kpconv: Flexible and deformable convolution for point clouds

Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2

2019

[31] [31]

Rio: 3d object instance re- localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 5

2019

[32] [32]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2, 8

work page arXiv 2025

[33] [33]

Pointpwc-net: A coarse-to-fine network for super- vised and self-supervised scene flow estimation on 3d point clouds.arXiv preprint arXiv:1911.12408, 2019

Wenxuan Wu, Zhiyuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: A coarse-to-fine network for super- vised and self-supervised scene flow estimation on 3d point clouds.arXiv preprint arXiv:1911.12408, 2019. 2

work page arXiv 1911

[34] [34]

Sonata: Self- supervised learning of reliable point representations.arXiv preprint arXiv:2503.16429, 2025

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations.arXiv preprint arXiv:2503.16429, 2025. 2

work page arXiv 2025

[35] [35]

Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707,

Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707, 2023. 3, 6

work page arXiv 2023

[36] [36]

Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,

work page arXiv

[37] [37]

Memory-based adapters for online 3d scene perception

Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21604–21613, 2024. 1, 2, 4, 5, 6, 7

2024

[38] [38]

SAM3D: Segment anything in 3D scenes.arXiv preprint arXiv:2306.03908, 2023

Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023. 1, 3, 6

work page arXiv 2023

[39] [39]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision, pages 162–

[40] [40]

Sai3d: Segment any instance in 3d scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3292–3302, 2024. 3, 6

2024

[41] [41]

Fusion-aware point convolution for online semantic 3d scene segmentation

Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4534– 4543, 2020. 2

2020

[42] [42]

3d-aware object goal navigation via simultaneous exploration and identification

Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672–6682, 2023. 1, 2

2023

[43] [43]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

2021

[44] [44]

Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment any- thing.arXiv preprint arXiv:2306.12156, 2023. 1, 3, 6

work page arXiv 2023