ESAM++: Efficient Online 3D Perception on the Edge
Pith reviewed 2026-06-29 08:47 UTC · model grok-4.3
The pith
ESAM++ replaces the 3D sparse UNet with a lighter Sparse Feature Pyramid Network to deliver competitive online 3D segmentation up to three times faster and with half the model size on edge devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ESAM++ introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds, replacing the computationally dominant 3D sparse UNet of the original ESAM and thereby achieving competitive accuracy on four segmentation benchmarks with up to three times faster inference and a two times smaller model size.
What carries the argument
The 3D Sparse Feature Pyramid Network (SFPN), which extracts multi-scale geometric features from point clouds with reduced overhead in place of a full 3D sparse UNet.
If this is right
- ESAM++ runs online 3D instance segmentation at interactive rates on CPUs or edge chips without GPUs.
- The model size drops by a factor of two while accuracy stays competitive across four standard benchmarks.
- Streaming point-cloud perception becomes feasible in privacy-sensitive or power-limited settings such as mobile robots.
- The same SFPN block can be swapped into other pipelines that currently rely on 3D sparse UNets for feature extraction.
Where Pith is reading between the lines
- Similar pyramid-style replacements might reduce compute in other sparse-data tasks such as 3D object detection or semantic mapping.
- If the accuracy holds on new datasets, the approach could extend real-time 3D perception to consumer phones or embedded cameras.
- The speed gain opens the door to higher frame-rate or higher-resolution inputs that were previously too expensive.
Load-bearing premise
The Sparse Feature Pyramid Network can pull out enough multi-scale detail from streaming point clouds to keep the same fine-grained segmentation quality that the original 3D sparse UNet provided.
What would settle it
Measure instance segmentation mIoU or AP on ScanNet using the released ESAM++ weights and check whether accuracy falls more than a few points below the original ESAM while the reported speed and size gains remain.
Figures
read the original abstract
Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ESAM++, a lightweight alternative to EmbodiedSAM (ESAM) for online 3D instance segmentation on edge devices. It introduces a 3D Sparse Feature Pyramid Network (SFPN) to replace the computationally expensive 3D sparse UNet for extracting multi-scale geometric features from streaming point clouds, claiming competitive accuracy on ScanNet, ScanNet200, SceneNN, and 3RScan while delivering up to 3x faster inference and 2x smaller model size.
Significance. If the empirical claims hold with rigorous validation, the work would enable practical real-time 3D perception on resource-constrained hardware without GPUs, with direct relevance to robotics, AR/VR, and autonomous systems. The SFPN design offers a potentially scalable approach to multi-scale feature extraction in streaming settings.
major comments (2)
- [Abstract] Abstract: the central claim of 'competitive accuracy' with concrete speed/size gains rests on empirical comparison to ESAM, yet the abstract (and by extension the evaluation) supplies no quantitative metrics, error bars, ablation details, or description of how the comparison to the original 3D sparse UNet was controlled; this absence directly undermines assessment of whether SFPN preserves fine-grained segmentation quality.
- [Method (SFPN)] Method section describing SFPN: the assertion that the lighter pyramid operations extract sufficient multi-scale features to support downstream SAM-based segmentation at the same quality as the 3D sparse UNet lacks any direct ablation, parameter comparison, or streaming-specific analysis showing preservation of local geometry; this is load-bearing for the generality claim across the four benchmarks.
minor comments (1)
- [Abstract] Abstract: the phrasing 'up to 3 times faster inference with a 2 times smaller model size' is imprecise without reference to specific hardware, batch sizes, or exact measured values.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract would benefit from explicit quantitative metrics and that additional analysis on SFPN would strengthen the method section. We address each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'competitive accuracy' with concrete speed/size gains rests on empirical comparison to ESAM, yet the abstract (and by extension the evaluation) supplies no quantitative metrics, error bars, ablation details, or description of how the comparison to the original 3D sparse UNet was controlled; this absence directly undermines assessment of whether SFPN preserves fine-grained segmentation quality.
Authors: We agree the abstract would be strengthened by including specific numbers. In the revision we will add the key quantitative results (e.g., mAP/mIoU on each benchmark, inference latency, and model size) directly into the abstract while retaining the high-level claims. Error bars, full ablation tables, and the controlled experimental protocol (identical SAM backbone, same streaming input settings, and evaluation metrics as ESAM) are already reported in Section 4 and the supplementary material; we will add a brief cross-reference in the abstract to these sections. This addresses the concern about assessing fine-grained quality preservation. revision: yes
-
Referee: [Method (SFPN)] Method section describing SFPN: the assertion that the lighter pyramid operations extract sufficient multi-scale features to support downstream SAM-based segmentation at the same quality as the 3D sparse UNet lacks any direct ablation, parameter comparison, or streaming-specific analysis showing preservation of local geometry; this is load-bearing for the generality claim across the four benchmarks.
Authors: We will add a dedicated ablation subsection (new Table in Section 3 or 4) that directly compares SFPN against the 3D sparse UNet on parameter count, FLOPs, and downstream instance segmentation metrics (AP, mIoU) across the four benchmarks. We will also include a streaming-specific analysis with visualizations of local geometry preservation (e.g., feature maps at different scales on sequential point clouds) and quantitative metrics such as feature similarity or boundary F-score. These additions will make the generality claim explicit and address the load-bearing nature of the SFPN design. revision: yes
Circularity Check
No circularity: empirical replacement of UNet by SFPN validated on external benchmarks
full rationale
The paper's core contribution is an empirical architecture change (SFPN for multi-scale streaming features) whose performance is measured by direct comparison to the prior ESAM method on ScanNet/ScanNet200/SceneNN/3RScan. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. The assumption that SFPN preserves segmentation quality is tested experimentally rather than asserted via self-definition or self-citation chains. Self-citation of ESAM is present but not load-bearing for the new result, as the evaluation uses independent datasets and metrics. This is a standard non-circular empirical improvement paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The Segment Anything Model can be leveraged for real-time 3D instance segmentation when paired with an appropriate point-cloud backbone.
- domain assumption Streaming 3D point clouds admit efficient multi-scale feature extraction via a sparse feature pyramid without requiring the capacity of a full 3D UNet.
invented entities (1)
-
3D Sparse Feature Pyramid Network (SFPN)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
End to End Learning for Self-Driving Cars
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016. 1
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal navi- gation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258,
-
[3]
4d spatio-temporal convnets: Minkowski convolutional neural networks
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,
-
[4]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5
2017
-
[5]
Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality
Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- teraction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 829–843, 2020. 1
2020
-
[6]
A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015
Alessandro Giusti, J ´erˆome Guzzi, Dan C Cires ¸an, Fang-Lin He, Juan P Rodr ´ıguez, Flavio Fontana, Matthias Faessler, Christian Forster, J¨urgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Let- ters, 1(2):661–667, 2015. 1
2015
-
[7]
3d semantic segmentation with submani- fold sparse convolutional networks
Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2
2018
-
[8]
Scenenn: A scene meshes dataset with annotations
Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In2016 fourth in- ternational conference on 3D vision (3DV), pages 92–101. Ieee, 2016. 5
2016
-
[9]
Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021
Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2
2021
-
[10]
Learning to drive in a day
Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, and Amar Shah. Learning to drive in a day. In2019 international conference on robotics and automation (ICRA), pages 8248–8254. IEEE, 2019. 1
2019
-
[11]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 3, 6
2023
-
[12]
Oneformer3d: One transformer for unified point cloud segmentation
Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 6, 7
2024
-
[13]
Top-down beats bottom-up in 3d in- stance segmentation
Maksim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Top-down beats bottom-up in 3d in- stance segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 3566–3574, 2024. 6, 7
2024
-
[14]
Segment and recognize anything at any granularity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. InEu- ropean Conference on Computer Vision, pages 467–484. Springer, 2024. 2, 3, 6
2024
-
[15]
Feature pyra- mid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2
2017
-
[16]
Ins-conv: Incremental sparse convolution for online 3d seg- mentation
Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18975– 18984, 2022. 1, 2, 6, 7
2022
-
[17]
Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks
John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. In2017 IEEE In- ternational Conference on Robotics and automation (ICRA), pages 4628–4635. IEEE, 2017. 2
2017
-
[18]
6-dof graspnet: Variational grasp generation for object manipula- tion
Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 2901–2910, 2019. 1, 2
2019
-
[19]
Panopticfusion: Online volumetric semantic mapping at the level of stuff and things
Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 4205–4212. IEEE, 2019. 2
2019
-
[20]
Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance
Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4018–4028, 2024. 6
2024
-
[21]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[23]
Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002
Wayne Piekarski and Bruce Thomas. Arquake: the outdoor augmented reality gaming system.Communications of the ACM, 45(1):36–38, 2002. 1
2002
-
[24]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[25]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2
2017
-
[26]
Langsplat: 3d language gaussian splatting
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 3
2024
-
[27]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Language- grounded indoor 3d semantic segmentation in the wild
David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022. 5
2022
-
[29]
Unscene3d: Unsupervised 3d instance segmentation for indoor scenes
David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19957–19967, 2024. 3
2024
-
[30]
Kpconv: Flexible and deformable convolution for point clouds
Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2
2019
-
[31]
Rio: 3d object instance re- localization in changing indoor environments
Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 5
2019
-
[32]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2, 8
-
[33]
Wenxuan Wu, Zhiyuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: A coarse-to-fine network for super- vised and self-supervised scene flow estimation on 3d point clouds.arXiv preprint arXiv:1911.12408, 2019. 2
-
[34]
Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations.arXiv preprint arXiv:2503.16429, 2025. 2
-
[35]
Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, and Xiaoguang Han. Sampro3d: Locating sam prompts in 3d for zero-shot scene segmentation.arXiv preprint arXiv:2311.17707, 2023. 3, 6
-
[36]
Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time.arXiv preprint arXiv:2408.11811,
-
[37]
Memory-based adapters for online 3d scene perception
Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21604–21613, 2024. 1, 2, 4, 5, 6, 7
2024
-
[38]
SAM3D: Segment anything in 3D scenes.arXiv preprint arXiv:2306.03908, 2023
Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023. 1, 3, 6
-
[39]
Gaussian grouping: Segment and edit anything in 3d scenes
Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision, pages 162–
-
[40]
Sai3d: Segment any instance in 3d scenes
Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3292–3302, 2024. 3, 6
2024
-
[41]
Fusion-aware point convolution for online semantic 3d scene segmentation
Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4534– 4543, 2020. 2
2020
-
[42]
3d-aware object goal navigation via simultaneous exploration and identification
Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672–6682, 2023. 1, 2
2023
-
[43]
Point transformer
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2
2021
-
[44]
Fast segment anything.arXiv preprint arXiv:2306.12156, 2023
Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment any- thing.arXiv preprint arXiv:2306.12156, 2023. 1, 3, 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.