Incremental Class Discovery for Semantic Segmentation with RGBD Sensing
Pith reviewed 2026-05-24 17:21 UTC · model grok-4.3
The pith
Aggregating RGBD frames into a dense 3D map discovers new semantic classes by clustering coherent unlabeled regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By first segmenting each RGBD frame with both color and geometry and then aggregating the results into a single segmented dense 3D map, the system identifies coherent regions that lack semantic labels and treats them as new object classes; these regions serve as the basic element for clustering both known and unseen objects while keeping memory and runtime low enough for incremental, semi-real-time operation.
What carries the argument
Coherent regions in the aggregated dense 3D map, which act as the primitive element for identifying and clustering new semantic classes instead of surfels or voxels.
If this is right
- The 3D map representation reduces both computational complexity and memory use relative to surfel- or voxel-based alternatives.
- The system runs at 10.7 Hz while incrementally updating the dense 3D map at every frame.
- Experiments on NYUDv2 demonstrate correct clustering of objects from both known and unseen classes.
- Quantitative comparisons with state-of-the-art supervised methods are reported alongside timing and component analyses.
Where Pith is reading between the lines
- The same coherent-region idea could be tested on other 3D sensors or SLAM pipelines if the initial segmentation step can be adapted.
- Treating the 3D map as the discovery primitive may reduce sensitivity to per-frame viewpoint changes compared with 2D-only methods.
- Integrating an online model update step after discovery could close the loop between segmentation and class learning.
Load-bearing premise
The initial per-frame segmentation of known classes is accurate enough that any unlabeled coherent region in the 3D map reliably represents a distinct semantic object rather than noise or a partial view.
What would settle it
Running the method on NYUDv2 and finding that a large fraction of the discovered coherent regions either merge multiple objects, split single objects, or fail to separate known from unseen classes would falsify the claim.
Figures
read the original abstract
This work addresses the task of open world semantic segmentation using RGBD sensing to discover new semantic classes over time. Although there are many types of objects in the real-word, current semantic segmentation methods make a closed world assumption and are trained only to segment a limited number of object classes. Towards a more open world approach, we propose a novel method that incrementally learns new classes for image segmentation. The proposed system first segments each RGBD frame using both color and geometric information, and then aggregates that information to build a single segmented dense 3D map of the environment. The segmented 3D map representation is a key component of our approach as it is used to discover new object classes by identifying coherent regions in the 3D map that have no semantic label. The use of coherent region in the 3D map as a primitive element, rather than traditional elements such as surfels or voxels, also significantly reduces the computational complexity and memory use of our method. It thus leads to semi-real-time performance at {10.7}Hz when incrementally updating the dense 3D map at every frame. Through experiments on the NYUDv2 dataset, we demonstrate that the proposed method is able to correctly cluster objects of both known and unseen classes. We also show the quantitative comparison with the state-of-the-art supervised methods, the processing time of each step, and the influences of each component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an incremental open-world semantic segmentation pipeline for RGBD data. Per-frame segmentation combines color and geometric cues; information is fused into a dense 3D map whose unlabeled coherent regions are treated as candidate instances of unseen classes. The 3D coherent-region primitive is claimed to reduce complexity, enabling 10.7 Hz incremental updates. Experiments on NYUDv2 are said to demonstrate correct clustering of both known and unseen classes together with runtime breakdowns and comparisons against supervised baselines.
Significance. If the discovery mechanism proves reliable, the work would contribute a practical route toward open-world mapping in robotics without requiring retraining for every new class. The efficiency argument for coherent 3D regions over surfels or voxels is a concrete engineering contribution, and the reported semi-real-time rate is a verifiable strength. The absence of quantitative metrics for the unseen-class clustering step, however, prevents a full assessment of impact.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments: the central claim that the method 'is able to correctly cluster objects of both known and unseen classes' is presented without any reported quantitative metric (purity, ARI, region-to-instance IoU, or similar) on NYUDv2 held-out classes; only qualitative demonstration, runtime, and supervised comparisons are mentioned.
- [Method] Method (coherent-region identification): the assumption that geometrically coherent unlabeled regions after known-class fusion correspond one-to-one with distinct semantic objects is load-bearing, yet no definition of coherence (connectivity rule, geometric threshold, handling of partial views) or ablation on fragmentation/aggregation artifacts is supplied.
minor comments (3)
- [Abstract] Abstract: 'real-word' should read 'real-world'.
- [Abstract] Abstract: the notation '{10.7}Hz' contains extraneous braces; write 10.7 Hz.
- [Abstract / Experiments] Abstract: the phrase 'influences of each component' is vague; the corresponding experimental subsection should list the exact ablations performed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The two major comments identify important gaps in quantitative evaluation and methodological detail. We address each below and will revise the manuscript to incorporate the requested additions.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments: the central claim that the method 'is able to correctly cluster objects of both known and unseen classes' is presented without any reported quantitative metric (purity, ARI, region-to-instance IoU, or similar) on NYUDv2 held-out classes; only qualitative demonstration, runtime, and supervised comparisons are mentioned.
Authors: We agree this is a substantive omission. The manuscript currently supports the clustering claim only with qualitative examples on NYUDv2. In revision we will add quantitative metrics (ARI, purity, and region-to-instance IoU) computed on held-out unseen classes, together with the corresponding experimental protocol. revision: yes
-
Referee: [Method] Method (coherent-region identification): the assumption that geometrically coherent unlabeled regions after known-class fusion correspond one-to-one with distinct semantic objects is load-bearing, yet no definition of coherence (connectivity rule, geometric threshold, handling of partial views) or ablation on fragmentation/aggregation artifacts is supplied.
Authors: The current text describes coherent regions at a high level but does not supply the requested formal definition or ablation. We will expand the method section with the exact connectivity rule, geometric thresholds, partial-view handling, and an ablation study quantifying fragmentation and aggregation effects. revision: yes
Circularity Check
No circularity: procedural method with no derivations or fitted predictions
full rationale
The paper describes a systems-level pipeline that segments RGBD frames, aggregates them into a dense 3D map, and identifies unlabeled coherent regions for new class discovery. No equations, parameter fits, or predictions appear in the provided text. The method relies on standard segmentation and geometric coherence steps without self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. Experiments on NYUDv2 provide external validation rather than internal equivalence to inputs. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con- tours to regions: An empirical evaluation. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 2294–2301, June 2009
work page 2009
-
[2]
P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- tour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, May 2011
work page 2011
- [3]
-
[4]
L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, April 2018
work page 2018
-
[5]
D. Comaniciu and P. Meer. Mean shift: a robust ap- proach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(5):603–619, May 2002
work page 2002
-
[6]
C. Couprie, C. Farabet, L. Najman, and Y . LeCun. Indoor semantic segmentation using depth information. In Interna- tional Conference on Learning Representations, 2013
work page 2013
-
[7]
Y . Deng and B. S. Manjunath. Unsupervised segmenta- tion of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):800–810, Aug 2001
work page 2001
-
[8]
P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph- based image segmentation. International Journal of Com- puter Vision, 59(2):167–181, Sep 2004
work page 2004
-
[9]
B. Fulkerson and S. Soatto. Really quick shift: Image seg- mentation on a gpu. In K. N. Kutulakos, editor, Trends and Topics in Computer Vision, pages 350–358, Berlin, Heidel- berg, 2012. Springer Berlin Heidelberg
work page 2012
-
[10]
A. Hermans, G. Floros, and B. Leibe. Dense 3d seman- tic mapping of indoor scenes from rgb-d images. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 2631–2638, May 2014
work page 2014
-
[11]
Y .-L. Huang and D.-R. Chen. Watershed segmentation for breast tumor in 2-d sonography.Ultrasound in Medicine and Biology, 30(5):625 – 632, 2004
work page 2004
-
[12]
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceed- ings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, pages 559–568, New York, NY , USA...
work page 2011
-
[13]
S. H. Kasaei, A. M. Tom ´e, L. S. Lopes, and M. Oliveira. Good: A global orthographic object descriptor for 3d object recognition and manipulation. Pattern Recognition Letters, 83:312–320, 2016
work page 2016
- [14]
-
[15]
H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se- mantic labeling of 3d point clouds for indoor scenes. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 244–252. Curran Associates, Inc., 2011
work page 2011
-
[16]
A. Kundu, Y . Li, F. Dellaert, F. Li, and J. M. Rehg. Joint semantic segmentation and 3d reconstruction from monocu- lar video. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuyte- laars, editors, Computer Vision – ECCV 2014 , pages 703– 718, Cham, 2014. Springer International Publishing
work page 2014
- [17]
-
[18]
X. Li, H. Ao, R. Belaroussi, and D. Gruyer. Fast semi-dense 3d semantic mapping with monocular visual slam. In 2017 IEEE 20th International Conference on Intelligent Trans- portation Systems (ITSC), pages 385–390, Oct 2017
work page 2017
-
[19]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, June 2015
work page 2015
-
[20]
J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolu- tional neural networks. In 2017 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 4628–4635, May 2017
work page 2017
-
[21]
Y . Nakajima, K. Tateno, F. Tombari, and H. Saito. Fast and accurate semantic mapping through geometric-based in- cremental segmentation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 385–392, Oct 2018
work page 2018
-
[22]
R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface map- ping and tracking. In 2011 10th IEEE International Sympo- sium on Mixed and Augmented Reality, pages 127–136, Oct 2011
work page 2011
-
[23]
C. V . Nguyen, S. Izadi, and D. Lovell. Modeling kinect sen- sor noise for improved 3d reconstruction and tracking. In 2012 second international conference on 3D imaging, mod- eling, processing, visualization & transmission , pages 524–
work page 2012
-
[24]
J. Pont-Tuset, P. Arbelez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for im- age segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, Jan 2017
work page 2017
-
[25]
V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray. Infinitam v3: A framework for large-scale 3d reconstruction with loop clo- sure. CoRR, abs/1708.00783, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [26]
-
[27]
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- lutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , pages 234–241, Cham, 2015. Springer International Publishing
work page 2015
-
[28]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog- nition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015
work page 2015
-
[29]
S. Sengupta, E. Greveson, A. Shahrokni, and P. H. S. Torr. Urban 3d semantic modelling using stereo vision. In 2013 IEEE International Conference on Robotics and Automation, pages 580–585, May 2013
work page 2013
-
[30]
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948
work page 1948
-
[31]
E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(4):640–651, April 2017
work page 2017
- [32]
-
[33]
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. In- door segmentation and support inference from rgbd images. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y . Sato, and C. Schmid, editors, Computer Vision – ECCV 2012 , pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidel- berg
work page 2012
-
[34]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015
work page 2015
-
[35]
S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 567–576, June 2015
work page 2015
- [36]
-
[37]
S. van Dongen. Graph clustering by flow simulation. Uni- versity of Utrecht, 2000
work page 2000
-
[38]
V . Vineet, O. Miksik, M. Lidegaard, M. Niener, S. Golodetz, V . A. Prisacariu, O. Khler, D. W. Murray, S. Izadi, P. Prez, and P. H. S. Torr. Incremental dense semantic stereo fu- sion for large-scale semantic scene reconstruction. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 75–82, May 2015
work page 2015
-
[39]
W-Net: A Deep Model for Fully Unsupervised Image Segmentation
X. Xia and B. Kulis. W-net: A deep model for fully unsuper- vised image segmentation. CoRR, abs/1711.08506, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
J. Yang, Z. Gan, K. Li, and C. Hou. Graph-based segmen- tation for rgb-d data using 3-d geometry enhanced superpix- els. IEEE Transactions on Cybernetics, 45(5):927–940, May 2015
work page 2015
-
[41]
S. Yang, Y . Huang, and S. Scherer. Semantic 3d occupancy mapping through efficient high order crfs. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 590–597, Sep. 2017
work page 2017
-
[42]
Z. Zhang. Microsoft kinect sensor and its effect. IEEE Mul- tiMedia, 19(2):4–10, Feb 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.