arxiv: 2605.07356 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

Shuai Zhang , Zhecheng Shi , Zhuxiao Li , Jing Ou , Tengxi Wang , Yuan Liu , Wufan Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentationmultimodal fusion2D-3D segmentationpoint cloudfeature decompositionshare-private subspacesdistribution shiftautonomous driving

0 comments

The pith

Decomposing 2D image and 3D point cloud features into shared semantic and private modality-specific subspaces unifies cross-modal fusion and improves accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to unify semantic segmentation across 2D camera images and 3D LiDAR point clouds by breaking extracted features into shared components that capture semantics common to both and private components that keep each sensor's unique traits. This addresses the practical difficulty of aligning sparse 3D scans with distorted 2D views in settings like autonomous driving and urban modeling. The framework pairs a SAM-based 2D encoder with an SPTNet-based 3D encoder, performs the explicit decomposition, fuses only the shared part via lightweight attention, and trains with regularization to enforce alignment plus subspace independence. If successful, the result is higher segmentation accuracy on large benchmarks together with stable behavior when test data comes from different geographic regions.

Core claim

We present UniD-Shift, a unified multimodal framework for joint 2D-3D semantic segmentation. Features from a SAM-based vision encoder and an SPTNet-based geometric encoder are decomposed into shared subspaces that summarize common semantic factors and private subspaces that preserve modality-specific properties. A lightweight attention-based fusion module aggregates the shared features, while a regularized training objective enforces semantic alignment and subspace independence, producing improved segmentation on SemanticKITTI and nuScenes benchmarks along with strong cross-domain generalization on nuScenes USA-Singapore.

What carries the argument

The interpretable share-private multimodal decomposition, which separates common semantic factors from modality-unique properties to support cross-modal alignment and fusion.

If this is right

Segmentation accuracy rises consistently over representative multimodal baselines on SemanticKITTI and nuScenes.
Performance remains stable under distribution shifts in cross-domain evaluation on nuScenes USA-Singapore.
Computational efficiency stays competitive because the fusion module is kept lightweight.
The explicit decomposition yields interpretable separation between shared semantics and modality-specific details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same share-private split could be tested on additional sensor pairs such as radar and camera to check whether common semantics transfer across more modalities.
Visual inspection of the learned shared subspace might identify which object categories align most reliably between 2D and 3D, informing sensor placement choices.
If private components turn out to be indispensable, the work implies that fully shared multimodal representations have inherent limits for segmentation tasks.
The regularization for subspace independence could be reused in other multimodal settings where alignment and uniqueness must both be preserved.

Load-bearing premise

That features learned from 2D images and 3D point clouds contain common semantic content that can be cleanly separated into independent shared and private subspaces without losing essential information.

What would settle it

An ablation experiment that removes the share-private decomposition, replaces it with direct feature concatenation, and shows no accuracy gain or a performance drop on the SemanticKITTI or nuScenes validation sets.

Figures

Figures reproduced from arXiv: 2605.07356 by Jing Ou, Shuai Zhang, Tengxi Wang, Wufan Zhao, Yuan Liu, Zhecheng Shi, Zhuxiao Li.

**Figure 2.** Figure 2: Overall architecture of the proposed UniD-Shift. The model takes synchronized 2D images and 3D point clouds as inputs. The 3D branch employs a SPTNet backbone to extract hierarchical geometric features, while the 2D branch utilizes a SAM-based encoder to obtain semantically enriched visual representations. Both modalities are decomposed into shared and private components, then feed into the Shared Attentio… view at source ↗

**Figure 3.** Figure 3: Architecture of the proposed shared-private feature de [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of object segmentation performance on the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of object segmentation performance on the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Representative failure cases on the nuScenes and Se [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: More visualization of object segmentation performance [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 9.** Figure 9: More visualizations comparing object segmentation performance with other methods on the nuScenes validation set. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: More visualizations comparing object segmentation performance with other methods on the SemanticKitti validation set. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an explicit share-private decomposition on top of SAM and SPTNet encoders for 2D-3D segmentation, but the reported gains are not isolated from the choice of backbones.

read the letter

The main takeaway is that they extract features with a SAM-based 2D encoder and SPTNet for 3D, split each into shared and private subspaces, fuse the shared part with a lightweight attention module, and add a regularizer to encourage alignment plus independence. This is positioned as a way to handle the sparse LiDAR sampling and view-dependent image distortions that make direct fusion tricky in driving scenes. The code is public, which helps reproducibility. They show accuracy lifts on SemanticKITTI and nuScenes over some multimodal baselines, plus stable cross-domain results on the USA-to-Singapore split of nuScenes, with competitive runtime. That cross-domain test is a reasonable check for generalization under distribution shift. The assumption that 2D and 3D features contain cleanly separable common semantics is stated up front and motivates the design. What is missing is any ablation that keeps the same encoders and fusion but removes the share-private split or the independence term. Without that, or without direct checks such as subspace correlation or reconstruction error, it is hard to tell whether the gains trace to the decomposition or simply to using stronger backbones than the baselines. The paper does not report error bars or full hyperparameter details in the abstract, so the strength of the empirical support is still unclear. This is aimed at researchers who already work on multimodal 3D segmentation for autonomous driving or urban modeling. It has enough concrete experiments and released code to merit a full review rather than a desk reject, though any referee would likely ask for the component ablations and subspace diagnostics to make the central claim convincing.

Referee Report

3 major / 2 minor

Summary. The paper introduces UniD-Shift, a unified 2D-3D semantic segmentation framework that extracts features via a SAM-based vision encoder and SPTNet-based geometric encoder, explicitly decomposes them into shared (common semantics) and private (modality-specific) subspaces, applies a lightweight attention-based fusion module on the shared components, and optimizes with a regularized objective enforcing alignment and subspace independence. It reports consistent accuracy gains over multimodal baselines on SemanticKITTI and nuScenes, competitive efficiency, and stable cross-domain performance on nuScenes USA-Singapore splits, with public code released.

Significance. If the share-private decomposition can be shown to drive the reported gains and to produce genuinely independent subspaces, the work would provide a principled, interpretable mechanism for cross-modal fusion that addresses view-dependent distortions and sparse sampling in LiDAR-image pairs. The public implementation and cross-domain stability results would be concrete strengths for reproducibility and generalization claims in autonomous-driving segmentation.

major comments (3)

[Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.
[§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.
[Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.

minor comments (2)

[§1] The abstract and §1 state that 2D and 3D features “share some common semantics” but do not cite prior empirical evidence or provide a motivating figure; a short related-work paragraph or illustrative example would strengthen the motivation.
[§3] Notation for the shared/private projections and the attention aggregator is introduced without an explicit table of symbols; adding one would improve readability for readers unfamiliar with the decomposition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional evidence would strengthen the central claims regarding the share-private decomposition. We address each major comment below and will revise the manuscript accordingly to provide the requested isolations and diagnostics.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.

Authors: We agree that an explicit ablation isolating the contribution of the share-private decomposition is necessary to attribute performance gains specifically to this mechanism. In the revised manuscript, we will add experiments that retain the identical SAM and SPTNet encoders along with the attention-based aggregator but replace the decomposition step with direct fusion of the full 2D and 3D features. This will allow direct comparison to the full UniD-Shift pipeline and quantify the incremental benefit of the decomposition and regularizer. revision: yes
Referee: [§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.

Authors: We acknowledge that empirical verification of subspace separation would provide stronger support for the interpretability and independence claims. In the revision, we will report quantitative diagnostics including (i) average correlation coefficients between shared and private subspaces, (ii) estimated mutual information between the components, and (iii) reconstruction fidelity metrics when reconstructing original features from the decomposed subspaces. These will be added to §3.2 and the experimental section. revision: yes
Referee: [Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.

Authors: We recognize that the current cross-domain results demonstrate stability but do not directly verify the role of the shared subspace in achieving robustness. In the revised manuscript, we will add analysis of the cross-domain behavior, including quantitative similarity measures (e.g., cosine similarity or correlation) of the shared features across the USA and Singapore splits, as well as qualitative comparisons showing how private components capture domain-specific variations while the shared subspace remains consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: novel decomposition framework with independent empirical validation on benchmarks

full rationale

The paper introduces a multimodal decomposition into shared and private subspaces using SAM and SPTNet encoders, followed by attention fusion and a regularized objective. No equations or claims reduce by construction to fitted parameters, self-citations, or prior ansatzes from the same authors. The derivation chain consists of standard feature extraction, explicit subspace separation motivated by domain insight, and fusion, all validated externally on SemanticKITTI, nuScenes, and cross-domain splits without tautological reductions. Self-citations are absent from load-bearing steps, and the method does not rename known results or import uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests primarily on a domain assumption about shared semantics between modalities and introduces share-private subspaces as a core mechanism; no free parameters or additional axioms are specified in the abstract.

axioms (1)

domain assumption Features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific.
Explicitly stated as the motivating insight in the abstract.

invented entities (1)

Shared and private subspaces no independent evidence
purpose: To summarize common semantic factors across modalities and preserve modality-unique properties for better fusion.
Core component of the proposed decomposition framework.

pith-pipeline@v0.9.0 · 5574 in / 1395 out tokens · 41945 ms · 2026-05-11T01:43:50.448348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages

[1]

Multimodality helps few-shot 3d point cloud semantic seg- mentation.arXiv preprint arXiv:2410.22489, 2024

Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, and Serge Belongie. Multimodality helps few-shot 3d point cloud semantic seg- mentation.arXiv preprint arXiv:2410.22489, 2024. 2

work page arXiv 2024
[2]

Generalized few-shot 3d point cloud segmentation with vision-language model

Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, and Serge Belongie. Generalized few-shot 3d point cloud segmentation with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16997– 17007, 2025. 2

work page 2025
[3]

Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving

Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, and Renaud Marlet. Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5240–5250, 2023. 5

work page 2023
[4]

Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,

work page
[5]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 5, 6

work page 2020
[6]

Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation

Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Sheng- hai Yuan, and Lihua Xie. Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9463–9470. IEEE, 2024. 3

work page 2024
[7]

Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation

Adriano Cardace, Pierluigi Zama Ramirez, Samuele Salti, and Luigi Di Stefano. Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 98–109,

work page
[8]

Cmdfusion: Bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation.IEEE Robotics and Automation Letters, 9(1):771–778, 2023

Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, and Qifeng Chen. Cmdfusion: Bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation.IEEE Robotics and Automation Letters, 9(1):771–778, 2023. 5, 3

work page 2023
[9]

Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation

Xuechao Chen, Shuangjie Xu, Xiaoyi Zou, Tongyi Cao, Dit- Yan Yeung, and Lu Fang. Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 8569–8578, 2023. 3

work page 2023
[10]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,

work page
[11]

Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds

Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. InInternational Symposium on Visual Computing, pages 207–222. Springer, 2020. 5, 7

work page 2020
[12]

From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection.IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021

Jiajun Deng, Wengang Zhou, Yanyong Zhang, and Houqiang Li. From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection.IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021. 2

work page 2021
[13]

Learning 3d semantic segmentation with only 2d image supervision

Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Panto- faru, Forrester Cole, Avneesh Sud, Brian Brewington, Brian Shucker, and Thomas Funkhouser. Learning 3d semantic segmentation with only 2d image supervision. In2021 In- ternational Conference on 3D Vision (3DV), pages 361–372. IEEE, 2021. 5, 3

work page 2021
[14]

3d semantic segmentation with submani- fold sparse convolutional networks

Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2

work page 2018
[15]

Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020

Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020. 1

work page 2020
[16]

Mvtn: Multi-view transformation network for 3d shape recognition supplementary material

Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition supplementary material. 2

work page
[17]

3d-sis: 3d se- mantic instance segmentation of rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 2

work page 2019
[18]

Randla-net: Efficient semantic segmentation of large-scale point clouds

Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11108– 11117, 2020. 7

work page 2020
[19]

xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation

Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emi- lie Wirbel, and Patrick P´erez. xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12605–12614, 2020. 3

work page 2020
[20]

Cross-modal learning for domain adaptation in 3d semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022

Maximilian Jaritz, Tuan-Hung Vu, Raoul De Charette, ´Emilie Wirbel, and Patrick P ´erez. Cross-modal learning for domain adaptation in 3d semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022. 2, 3, 8

work page 2022
[21]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 2

work page 2020
[22]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023
[23]

Oneformer3d: One transformer for unified point cloud segmentation

Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 1

work page 2024
[24]

Stratified trans- former for 3d point cloud segmentation

Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans- former for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022. 2

work page 2022
[25]

Spherical transformer for lidar-based 3d recognition

Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 17545–17555, 2023. 5, 3

work page 2023
[26]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2

work page 2019
[27]

Effective sam combination for open-vocabulary semantic segmenta- tion

Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Effective sam combination for open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26081–26090, 2025. 3

work page 2025
[28]

Memoryseg: Online lidar semantic segmentation with a latent memory

Enxu Li, Sergio Casas, and Raquel Urtasun. Memoryseg: Online lidar semantic segmentation with a latent memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 745–754, 2023. 5

work page 2023
[29]

Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing

Jiale Li, Hang Dai, Hao Han, and Yong Ding. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21694–21704,

work page
[30]

Cross-domain and cross-modal knowledge distillation in domain adaptation for 3d semantic segmentation

Miaoyu Li, Yachao Zhang, Yuan Xie, Zuodong Gao, Cui- hua Li, Zhizhong Zhang, and Yanyun Qu. Cross-domain and cross-modal knowledge distillation in domain adaptation for 3d semantic segmentation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3829–3837,

work page
[31]

Bidirectional learning for domain adaptation of semantic segmentation

Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6936–6945, 2019. 2, 8

work page 2019
[32]

Deep continuous fusion for multi-sensor 3d object detection

Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. InProceedings of the European conference on computer vi- sion (ECCV), pages 641–656, 2018. 2

work page 2018
[33]

Unidxmd: Towards unified represen- tation for cross-modal unsupervised domain adaptation in 3d semantic segmentation

Zhengyin Liang, Hui Yin, Min Liang, Qianqian Du, Ying Yang, and Hua Huang. Unidxmd: Towards unified represen- tation for cross-modal unsupervised domain adaptation in 3d semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 20346– 20356, 2025. 3, 8

work page 2025
[34]

Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning.ISPRS Journal of Photogrammetry and Remote Sensing, 176:211– 221, 2021

Wei Liu, Zhiming Luo, Yuanzheng Cai, Ying Yu, Yang Ke, Jos´e Marcato Junior, Wesley Nunes Gonc ¸alves, and Jonathan Li. Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning.ISPRS Journal of Photogrammetry and Remote Sensing, 176:211– 221, 2021. 3, 8

work page 2021
[35]

Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase

Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, et al. Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21662–21673, 2023. 2, 7, 8

work page 2023
[36]

Csfnet: Cross- modal semantic focus network for sematic segmentation of large-scale point clouds.IEEE Transactions on Geoscience and Remote Sensing, 2025

Yang Luo, Ting Han, Yujun Liu, Jinhe Su, Yiping Chen, Jinyuan Li, Yundong Wu, and Guorong Cai. Csfnet: Cross- modal semantic focus network for sematic segmentation of large-scale point clouds.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 5, 6, 7, 8

work page 2025
[37]

Paseg: positional-guided segmenter with multimodal semantic alignment for enhancing urban scene 3d semantic segmentation.International Journal of Digital Earth, 18(1):2528811, 2025

Yang Luo, Ting Han, Xiaorong Zhang, Yujun Liu, Duxin Zhu, Jinyuan Li, Yiping Chen, Yundong Wu, Guorong Cai, Yingchao Piao, et al. Paseg: positional-guided segmenter with multimodal semantic alignment for enhancing urban scene 3d semantic segmentation.International Journal of Digital Earth, 18(1):2528811, 2025. 2

work page 2025
[38]

V oxnet: A 3d con- volutional neural network for real-time object recognition

Daniel Maturana and Sebastian Scherer. V oxnet: A 3d con- volutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015. 2

work page 2015
[39]

Rangenet++: Fast and accurate lidar semantic segmentation

Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4213–4220. IEEE, 2019. 5, 7

work page 2019
[40]

Minimal-entropy correlation alignment for unsupervised deep domain adaptation.arXiv preprint arXiv:1711.10288,

Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation.arXiv preprint arXiv:1711.10288,

work page arXiv
[41]

Sparse-to-dense feature matching: Intra and inter do- main cross-modal learning in domain adaptation for 3d se- mantic segmentation

Duo Peng, Yinjie Lei, Wen Li, Pingping Zhang, and Yulan Guo. Sparse-to-dense feature matching: Intra and inter do- main cross-modal learning in domain adaptation for 3d se- mantic segmentation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7108–7117,

work page
[42]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[43]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[44]

An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models

Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, and Liang Xiao. An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27325–27335, 2025. 1

work page 2025
[45]

Deep sliding shapes for amodal 3d object detection in rgb-d images

Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 808–816, 2016. 2

work page 2016
[46]

Multi-view convolutional neural networks for 3d shape recognition

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,

work page
[47]

Tianfang Sun, Zhizhong Zhang, Xin Tan, Yong Peng, Yanyun Qu, and Yuan Xie. Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmen- tation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11059–11072, 2024. 5, 7, 3

work page 2024
[48]

Searching efficient 3d architec- tures with sparse point-voxel convolution

Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architec- tures with sparse point-voxel convolution. InEuropean con- ference on computer vision, pages 685–702. Springer, 2020. 6, 7, 8, 3

work page 2020
[49]

Kpconv: Flexible and deformable convolution for point clouds

Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2

work page 2019
[50]

Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation

Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick P ´erez. Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2517–2526, 2019. 2, 8

work page 2019
[51]

Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud

Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud. In2019 international confer- ence on robotics and automation (ICRA), pages 4376–4382. IEEE, 2019. 7

work page 2019
[52]

Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond

Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, and Risheng Liu. Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17882–17891,

work page
[53]

Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022. 2, 6

work page 2022
[54]

Taseg: Temporal aggregation network for li- dar semantic segmentation

Xiaopei Wu, Yuenan Hou, Xiaoshui Huang, Binbin Lin, Tong He, Xinge Zhu, Yuexin Ma, Boxi Wu, Haifeng Liu, Deng Cai, et al. Taseg: Temporal aggregation network for li- dar semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15311–15320, 2024. 5, 3

work page 2024
[55]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 2, 5, 7

work page 2024
[56]

Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation

Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Jianping Fan, Zhongchao Shi, and Yanyun Qu. Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 490–498, 2023. 3, 8

work page 2023
[57]

Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior

Yao Wu, Mingwei Xing, Yachao Zhang, Xiaotong Luo, Yuan Xie, and Yanyun Qu. Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior. Advances in Neural Information Processing Systems, 37: 101223–101249, 2024. 2, 3, 8

work page 2024
[58]

Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Kaibei Peng, and Yanyun Qu. Fusion-then-distillation: Toward cross-modal positive distillation for domain adaptive 3d se- mantic segmentation.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 3, 8

work page 2025
[59]

Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation

Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2020. 7

work page 2020
[60]

Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation

Jianyun Xu, Ruixiang Zhang, Jian Dou, Yushi Zhu, Jie Sun, and Shiliang Pu. Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16024–16033, 2021. 2

work page 2021
[61]

2dpass: 2d priors assisted semantic segmentation on lidar point clouds

Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European conference on computer vision, pages 677–695. Springer, 2022. 2, 5, 6, 7, 3

work page 2022
[62]

Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018. 2

work page 2018
[63]

Pixor: Real- time 3d object detection from point clouds

Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 7652–7660, 2018. 2

work page 2018
[64]

Swin3D: A Pre- trained Transformer Backbone for 3D Indoor Scene Understanding,

Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding.arXiv preprint arXiv:2304.06906,

work page arXiv
[65]

Swin3d++: Effec- tive multi-source pretraining for 3d indoor scene understand- ing.arXiv preprint arXiv:2402.14215, 2024

Yu-Qi Yang, Yu-Xiao Guo, and Yang Liu. Swin3d++: Effec- tive multi-source pretraining for 3d indoor scene understand- ing.arXiv preprint arXiv:2402.14215, 2024. 2

work page arXiv 2024
[66]

Lidarmultinet: Towards a unified multi-task network for lidar perception

Dongqiangzi Ye, Zixiang Zhou, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarmultinet: Towards a unified multi-task network for lidar perception. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 3231–3240, 2023. 5, 3

work page 2023
[67]

Fa-resnet: Feature affine residual network for large-scale point cloud segmen- tation.International Journal of Applied Earth Observation and Geoinformation, 118:103259, 2023

Lixin Zhan, Wei Li, and Weidong Min. Fa-resnet: Feature affine residual network for large-scale point cloud segmen- tation.International Journal of Applied Earth Observation and Geoinformation, 118:103259, 2023. 7

work page 2023
[68]

Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation

Boxiang Zhang, Zunran Wang, Yonggen Ling, Yuanyuan Guan, Shenghao Zhang, and Wenhui Li. Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3401–3409, 2023. 3

work page 2023
[69]

Sptnet: Sparse convolution and transformer network for woody and foliage components sep- aration from point clouds.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024

Shuai Zhang, Yiping Chen, Biao Wang, Dong Pan, Wum- ing Zhang, and Aiguang Li. Sptnet: Sparse convolution and transformer network for woody and foliage components sep- aration from point clouds.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024. 3

work page 2024
[70]

Shuai Zhang, Biao Wang, Yiping Chen, Shuhang Zhang, and Wuming Zhang. Point and voxel cross perception with lightweight cosformer for large-scale point cloud semantic segmentation.International Journal of Applied Earth Ob- servation and Geoinformation, 131:103951, 2024. 1

work page 2024
[71]

Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation

Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze- rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9601–9610, 2020. 5, 3

work page 2020
[72]

Self-supervised exclusive learning for 3d segmentation with cross-modal un- supervised domain adaptation

Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu. Self-supervised exclusive learning for 3d segmentation with cross-modal un- supervised domain adaptation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3338– 3346, 2022. 3, 8

work page 2022
[73]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

work page 2021
[74]

Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis

Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, and Kaizhu Huang. Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29395–29405, 2025. 1

work page 2025
[75]

Spherical frustum sparse con- volution network for lidar point cloud semantic segmenta- tion.Advances in Neural Information Processing Systems, 37:121827–121858, 2024

Yu Zheng, Guangming Wang, Jiuming Liu, Marc Polle- feys, and Hesheng Wang. Spherical frustum sparse con- volution network for lidar point cloud semantic segmenta- tion.Advances in Neural Information Processing Systems, 37:121827–121858, 2024. 5

work page 2024
[76]

Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation

Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, and Dahua Lin. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550, 2020. 5, 3

work page arXiv 2008
[77]

Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion

Zixiang Zhou, Yang Zhang, and Hassan Foroosh. Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13194–13203,

work page
[78]

Lidarformer: A unified transformer-based multi-task network for lidar per- ception

Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarformer: A unified transformer-based multi-task network for lidar per- ception. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14740–14747. IEEE, 2024. 5, 3

work page 2024
[79]

Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting

Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, and Chi-Wing Fu. Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3656–3665, 2025. 2

work page 2025
[80]

Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation

Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9939–9948,

work page

Showing first 80 references.