arxiv: 2604.11007 · v1 · submitted 2026-04-13 · 💻 cs.CV

Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling

Takahiko Furuya This is my paper

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D point cloud segmentationdata-efficient learningpseudo-labelingopen-vocabulary segmentationweakly supervised learningsemantic segmentationScanNetS3DIS

0 comments

The pith

PLOVIS trains 3D point cloud segmentation models effectively by rendering 2D images and generating pseudo-labels with open-vocabulary segmentation, even with only tens of scenes and under 100 annotations per scene.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PLOVIS to handle three simultaneous shortages in 3D semantic segmentation: few training scenes, very sparse point labels, and no original 2D image sequences. It renders 2D images directly from the available 3D point clouds, runs them through an open-vocabulary image segmentation model to create pseudo labels, then cleans those labels with a two-stage filter that drops low-confidence and likely incorrect ones. A class-balanced memory bank stores examples to stabilize training despite imbalance. On four standard benchmarks the method beats both ordinary fine-tuning and current weakly supervised techniques when data is restricted to realistic low amounts. A reader would care because collecting dense 3D annotations remains expensive, so techniques that work with minimal labels could expand where semantic segmentation is usable.

Core claim

By rendering 2D images straight from the training 3D point clouds and feeding them to an open-vocabulary image segmentation model to obtain pseudo labels, then refining those labels through two-stage filtering and a class-balanced memory bank, PLOVIS produces a usable training signal that lets standard 3D segmentation networks reach competitive accuracy despite the joint absence of dense scenes, dense point labels, and source 2D sequences.

What carries the argument

The PLOVIS pipeline that renders 2D images from 3D point clouds, passes them to an open-vocabulary segmentation model for initial pseudo labels, applies two-stage filtering to remove noisy labels, and uses a class-balanced memory bank to train the 3D segmentation network.

If this is right

3D segmentation training becomes possible without any paired 2D image sequences from the original capture.
Models can be trained successfully when each scene supplies fewer than 100 annotated points across only a few dozen scenes.
The two-stage filter and memory bank together reduce the impact of noisy and imbalanced pseudo labels enough to outperform standard fine-tuning and existing weakly supervised baselines.
The same performance advantage appears on ScanNet, S3DIS, Toronto3D, and Semantic3D under identical low-data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rendering-plus-open-vocabulary strategy might be adapted to other 3D tasks such as instance segmentation or object detection where similar pseudo-label noise patterns occur.
Further gains could come from pairing the method with newer open-vocabulary models that better preserve 3D-projected geometry.
One could test whether the same pipeline works when point clouds come from different sensors or when annotation budgets drop below 50 points per scene.

Load-bearing premise

That 2D images rendered directly from sparse 3D point clouds retain enough geometric fidelity for an open-vocabulary image segmentation model to produce pseudo labels whose noise can be adequately removed by the two-stage filter and memory bank.

What would settle it

A controlled test that measures the fraction of pseudo labels that match ground-truth annotations on the rendered training images, both before and after the two-stage filter, on a held-out portion of the benchmark scenes.

read the original abstract

Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real-world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point-level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data-efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data-efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo-Labeling via Open-Vocabulary Image Segmentation (PLOVIS), leverages an Open-Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo-labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two-stage filtering of pseudo labels combined with a class-balanced memory bank for effective training. The two-stage filtering mechanism first removes low-confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data-scarce conditions (a few tens of training 3D scenes, each annotated with only <100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine-tuning strategies and state-of-the-art weakly supervised learning algorithms. Code will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PLOVIS renders sparse point clouds to 2D views for OVIS pseudo-labeling plus two-stage filtering and a memory bank, which jointly tackles the three data shortages but rests on whether those low-density projections still yield usable image segments.

read the letter

The core idea here is to generate pseudo labels for 3D point cloud segmentation by projecting the points into 2D images and feeding them to an open-vocabulary image segmentation model, then cleaning the output with a two-stage filter and a class-balanced memory bank. This setup is meant to work when you have only a few dozen scenes, fewer than 100 labeled points per scene, and no original 2D image sequences at all. The paper shows this pipeline beating standard fine-tuning and existing weakly supervised baselines on ScanNet, S3DIS, Toronto3D, and Semantic3D under those constraints. That joint coverage of all three scarcity types is the main new piece relative to prior work that usually fixes only one or two of them. The rendering step itself is a simple way to sidestep the missing 2D data problem, and the filter plus memory bank are practical additions for dealing with noisy and imbalanced labels. The experiments are run on standard benchmarks with realistic low-data regimes, which makes the claims testable once the promised code appears. The soft spot is the rendering quality. With under 100 points per scene the 2D projections will have large empty areas and missing surface cues, and an OVIS model trained on natural photos may produce pseudo labels that are biased or incomplete in ways the two-stage filter cannot fully repair. If the gains shrink or disappear when the rendering parameters change, the data-efficiency argument weakens. This is aimed at people building 3D perception systems for robotics or mapping who cannot afford dense annotations. A reader already working on weakly supervised 3D segmentation would get a concrete pipeline worth trying or extending. It deserves peer review because the problem is common, the method is clearly specified, and the empirical claims can be checked against the numbers and ablations in the full paper. I would send it to referees with a request for more detail on the projection process and failure cases.

Referee Report

2 major / 1 minor

Summary. The paper proposes PLOVIS, a data-efficient framework for 3D point cloud semantic segmentation that handles scarcity of training scenes, sparse point annotations (<100 per scene), and lack of original 2D image sequences. It renders 2D images directly from the 3D point clouds, applies an open-vocabulary image segmentation (OVIS) model to generate pseudo-labels, employs a two-stage filtering process (low-confidence removal followed by likely-incorrect label discarding) to mitigate noise, and uses a class-balanced memory bank during training. Experiments on ScanNet, S3DIS, Toronto3D, and Semantic3D under realistic low-data regimes claim consistent outperformance over standard fine-tuning and state-of-the-art weakly supervised methods.

Significance. If the empirical results hold, the work is significant for addressing the joint challenges of multiple data insufficiencies in real-world 3D segmentation, a scenario not fully explored by prior methods. It innovatively transfers 2D OVIS capabilities to 3D without needing dense reconstructions or original images. The planned public code release is a clear strength, supporting reproducibility and further validation of the pipeline.

major comments (2)

[Method] The rendering of 2D images from extremely sparse point clouds (<100 points per scene) is load-bearing for the central claim, as it directly determines whether OVIS can produce usable pseudo-labels. The method section should explicitly describe the projection/rendering procedure (including any handling of empty regions or point coloring) and provide qualitative examples of rendered views with corresponding OVIS outputs to address whether geometric fidelity is retained.
[Experiments] Table(s) reporting quantitative results: the outperformance claim on four datasets requires specific mIoU (or equivalent) numbers, standard deviations across runs, and direct comparisons to fine-tuning and weakly-supervised baselines. Without these, and without ablations isolating the two-stage filter and memory bank, the data-efficiency argument cannot be fully assessed.

minor comments (1)

[Abstract] The abstract would benefit from including one or two key quantitative results (e.g., average mIoU improvement) to make the empirical contribution immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Method] The rendering of 2D images from extremely sparse point clouds (<100 points per scene) is load-bearing for the central claim, as it directly determines whether OVIS can produce usable pseudo-labels. The method section should explicitly describe the projection/rendering procedure (including any handling of empty regions or point coloring) and provide qualitative examples of rendered views with corresponding OVIS outputs to address whether geometric fidelity is retained.

Authors: We agree that greater explicitness on the rendering step is warranted given the sparsity. In the revised manuscript we will expand the method section to detail the projection procedure (orthographic or perspective projection from chosen viewpoints, with point size and splatting parameters), handling of empty regions (background pixels assigned a fixed neutral color with no depth value), and point coloring (direct use of available RGB attributes or default intensity mapping). We will also add a new figure with side-by-side qualitative examples of rendered 2D views and the corresponding OVIS pseudo-label maps to demonstrate that geometric structure remains sufficiently preserved for usable pseudo-labels. revision: yes
Referee: [Experiments] Table(s) reporting quantitative results: the outperformance claim on four datasets requires specific mIoU (or equivalent) numbers, standard deviations across runs, and direct comparisons to fine-tuning and weakly-supervised baselines. Without these, and without ablations isolating the two-stage filter and memory bank, the data-efficiency argument cannot be fully assessed.

Authors: The manuscript already contains tables with mIoU results on ScanNet, S3DIS, Toronto3D, and Semantic3D under the stated low-data regimes, including comparisons against fine-tuning and several weakly-supervised baselines. To address the concern we will (i) explicitly tabulate all numerical mIoU values with standard deviations obtained from three independent runs, (ii) ensure every baseline is listed in the same table for direct comparison, and (iii) add a dedicated ablation table that isolates the two-stage filter and the class-balanced memory bank while keeping all other components fixed. These additions will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external components and benchmark validation

full rationale

The paper presents PLOVIS as a practical pipeline: render 2D views from sparse 3D point clouds, apply an external open-vocabulary image segmentation model to generate pseudo-labels, then use two-stage filtering plus a class-balanced memory bank for training a 3D segmenter. All performance claims rest on direct experimental comparisons against fine-tuning and weakly-supervised baselines on ScanNet, S3DIS, Toronto3D, and Semantic3D under the stated data-scarce regimes. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation; the method is self-contained against external benchmarks and does not reduce its results to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unproven assumption that rendered 2D views preserve sufficient information for reliable OVIS pseudo-labeling and that the proposed filters remove noise without introducing new bias; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Rendered 2D projections from 3D point clouds retain enough semantic cues for open-vocabulary 2D models to produce usable pseudo-labels
Invoked when the method creates 2D images directly from training 3D point clouds for pseudo-labeling
domain assumption Two-stage filtering plus class-balanced memory bank sufficiently mitigates noise and imbalance in pseudo-labels
Central to the training procedure described in the abstract

pith-pipeline@v0.9.0 · 5598 in / 1512 out tokens · 32966 ms · 2026-05-10T15:02:14.820862+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

a photo of a

Conclusion and future work This paper tackled semantic segmentation of 3D point clouds under three data insufficiency conditions that frequently arise in real-world applications: scarcity of training scenes, scarcity of annotations, and absence of image sequences. To overcome these challenges, we proposed Point pseudo-Labeling via Open-Vocabulary Image Se...

work page 2024
[2]

ICIG 2025, pp

Shichao Dong, Fayao Liu, Rui Yao, Guosheng Lin, Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation, Proc. ICIG 2025, pp. 304–315, 2025. [11] Karim Abou Zeid, Kadir Yilmaz, Daan de Geus, Alexander Hermans, David Adrian, Timm Linder, Bastian Leibe, DINO in the Room: Leveraging 2D Foundation Models for ...

work page 2025
[3]

DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao, Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning, Proc. CVPR 2023, pp. 9415–9424, 2023. [43] Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao, Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Represe...

work page internal anchor Pith review Pith/arXiv arXiv 2023