pith. machine review for the scientific record. sign in

arxiv: 2605.07356 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationmultimodal fusion2D-3D segmentationpoint cloudfeature decompositionshare-private subspacesdistribution shiftautonomous driving
0
0 comments X

The pith

Decomposing 2D image and 3D point cloud features into shared semantic and private modality-specific subspaces unifies cross-modal fusion and improves accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to unify semantic segmentation across 2D camera images and 3D LiDAR point clouds by breaking extracted features into shared components that capture semantics common to both and private components that keep each sensor's unique traits. This addresses the practical difficulty of aligning sparse 3D scans with distorted 2D views in settings like autonomous driving and urban modeling. The framework pairs a SAM-based 2D encoder with an SPTNet-based 3D encoder, performs the explicit decomposition, fuses only the shared part via lightweight attention, and trains with regularization to enforce alignment plus subspace independence. If successful, the result is higher segmentation accuracy on large benchmarks together with stable behavior when test data comes from different geographic regions.

Core claim

We present UniD-Shift, a unified multimodal framework for joint 2D-3D semantic segmentation. Features from a SAM-based vision encoder and an SPTNet-based geometric encoder are decomposed into shared subspaces that summarize common semantic factors and private subspaces that preserve modality-specific properties. A lightweight attention-based fusion module aggregates the shared features, while a regularized training objective enforces semantic alignment and subspace independence, producing improved segmentation on SemanticKITTI and nuScenes benchmarks along with strong cross-domain generalization on nuScenes USA-Singapore.

What carries the argument

The interpretable share-private multimodal decomposition, which separates common semantic factors from modality-unique properties to support cross-modal alignment and fusion.

If this is right

  • Segmentation accuracy rises consistently over representative multimodal baselines on SemanticKITTI and nuScenes.
  • Performance remains stable under distribution shifts in cross-domain evaluation on nuScenes USA-Singapore.
  • Computational efficiency stays competitive because the fusion module is kept lightweight.
  • The explicit decomposition yields interpretable separation between shared semantics and modality-specific details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same share-private split could be tested on additional sensor pairs such as radar and camera to check whether common semantics transfer across more modalities.
  • Visual inspection of the learned shared subspace might identify which object categories align most reliably between 2D and 3D, informing sensor placement choices.
  • If private components turn out to be indispensable, the work implies that fully shared multimodal representations have inherent limits for segmentation tasks.
  • The regularization for subspace independence could be reused in other multimodal settings where alignment and uniqueness must both be preserved.

Load-bearing premise

That features learned from 2D images and 3D point clouds contain common semantic content that can be cleanly separated into independent shared and private subspaces without losing essential information.

What would settle it

An ablation experiment that removes the share-private decomposition, replaces it with direct feature concatenation, and shows no accuracy gain or a performance drop on the SemanticKITTI or nuScenes validation sets.

Figures

Figures reproduced from arXiv: 2605.07356 by Jing Ou, Shuai Zhang, Tengxi Wang, Wufan Zhao, Yuan Liu, Zhecheng Shi, Zhuxiao Li.

Figure 1
Figure 1. Figure 1: Existing multimodal fusion strategies (left) often mix [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the proposed UniD-Shift. The model takes synchronized 2D images and 3D point clouds as inputs. The 3D branch employs a SPTNet backbone to extract hierarchical geometric features, while the 2D branch utilizes a SAM-based encoder to obtain semantically enriched visual representations. Both modalities are decomposed into shared and private components, then feed into the Shared Attentio… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the proposed shared-private feature de [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of object segmentation performance on the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of object segmentation performance on the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative failure cases on the nuScenes and Se [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More visualization of object segmentation performance [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: More visualizations comparing object segmentation performance with other methods on the nuScenes validation set. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More visualizations comparing object segmentation performance with other methods on the SemanticKitti validation set. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces UniD-Shift, a unified 2D-3D semantic segmentation framework that extracts features via a SAM-based vision encoder and SPTNet-based geometric encoder, explicitly decomposes them into shared (common semantics) and private (modality-specific) subspaces, applies a lightweight attention-based fusion module on the shared components, and optimizes with a regularized objective enforcing alignment and subspace independence. It reports consistent accuracy gains over multimodal baselines on SemanticKITTI and nuScenes, competitive efficiency, and stable cross-domain performance on nuScenes USA-Singapore splits, with public code released.

Significance. If the share-private decomposition can be shown to drive the reported gains and to produce genuinely independent subspaces, the work would provide a principled, interpretable mechanism for cross-modal fusion that addresses view-dependent distortions and sparse sampling in LiDAR-image pairs. The public implementation and cross-domain stability results would be concrete strengths for reproducibility and generalization claims in autonomous-driving segmentation.

major comments (3)
  1. [Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.
  2. [§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.
  3. [Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.
minor comments (2)
  1. [§1] The abstract and §1 state that 2D and 3D features “share some common semantics” but do not cite prior empirical evidence or provide a motivating figure; a short related-work paragraph or illustrative example would strengthen the motivation.
  2. [§3] Notation for the shared/private projections and the attention aggregator is introduced without an explicit table of symbols; adding one would improve readability for readers unfamiliar with the decomposition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional evidence would strengthen the central claims regarding the share-private decomposition. We address each major comment below and will revise the manuscript accordingly to provide the requested isolations and diagnostics.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.

    Authors: We agree that an explicit ablation isolating the contribution of the share-private decomposition is necessary to attribute performance gains specifically to this mechanism. In the revised manuscript, we will add experiments that retain the identical SAM and SPTNet encoders along with the attention-based aggregator but replace the decomposition step with direct fusion of the full 2D and 3D features. This will allow direct comparison to the full UniD-Shift pipeline and quantify the incremental benefit of the decomposition and regularizer. revision: yes

  2. Referee: [§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.

    Authors: We acknowledge that empirical verification of subspace separation would provide stronger support for the interpretability and independence claims. In the revision, we will report quantitative diagnostics including (i) average correlation coefficients between shared and private subspaces, (ii) estimated mutual information between the components, and (iii) reconstruction fidelity metrics when reconstructing original features from the decomposed subspaces. These will be added to §3.2 and the experimental section. revision: yes

  3. Referee: [Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.

    Authors: We recognize that the current cross-domain results demonstrate stability but do not directly verify the role of the shared subspace in achieving robustness. In the revised manuscript, we will add analysis of the cross-domain behavior, including quantitative similarity measures (e.g., cosine similarity or correlation) of the shared features across the USA and Singapore splits, as well as qualitative comparisons showing how private components capture domain-specific variations while the shared subspace remains consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: novel decomposition framework with independent empirical validation on benchmarks

full rationale

The paper introduces a multimodal decomposition into shared and private subspaces using SAM and SPTNet encoders, followed by attention fusion and a regularized objective. No equations or claims reduce by construction to fitted parameters, self-citations, or prior ansatzes from the same authors. The derivation chain consists of standard feature extraction, explicit subspace separation motivated by domain insight, and fusion, all validated externally on SemanticKITTI, nuScenes, and cross-domain splits without tautological reductions. Self-citations are absent from load-bearing steps, and the method does not rename known results or import uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests primarily on a domain assumption about shared semantics between modalities and introduces share-private subspaces as a core mechanism; no free parameters or additional axioms are specified in the abstract.

axioms (1)
  • domain assumption Features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific.
    Explicitly stated as the motivating insight in the abstract.
invented entities (1)
  • Shared and private subspaces no independent evidence
    purpose: To summarize common semantic factors across modalities and preserve modality-unique properties for better fusion.
    Core component of the proposed decomposition framework.

pith-pipeline@v0.9.0 · 5574 in / 1395 out tokens · 41945 ms · 2026-05-11T01:43:50.448348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages

  1. [1]

    Multimodality helps few-shot 3d point cloud semantic seg- mentation.arXiv preprint arXiv:2410.22489, 2024

    Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, and Serge Belongie. Multimodality helps few-shot 3d point cloud semantic seg- mentation.arXiv preprint arXiv:2410.22489, 2024. 2

  2. [2]

    Generalized few-shot 3d point cloud segmentation with vision-language model

    Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, and Serge Belongie. Generalized few-shot 3d point cloud segmentation with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16997– 17007, 2025. 2

  3. [3]

    Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving

    Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, and Renaud Marlet. Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5240–5250, 2023. 5

  4. [4]

    Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,

  5. [5]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 5, 6

  6. [6]

    Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation

    Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Sheng- hai Yuan, and Lihua Xie. Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9463–9470. IEEE, 2024. 3

  7. [7]

    Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation

    Adriano Cardace, Pierluigi Zama Ramirez, Samuele Salti, and Luigi Di Stefano. Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 98–109,

  8. [8]

    Cmdfusion: Bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation.IEEE Robotics and Automation Letters, 9(1):771–778, 2023

    Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, and Qifeng Chen. Cmdfusion: Bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation.IEEE Robotics and Automation Letters, 9(1):771–778, 2023. 5, 3

  9. [9]

    Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation

    Xuechao Chen, Shuangjie Xu, Xiaoyi Zou, Tongyi Cao, Dit- Yan Yeung, and Lu Fang. Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 8569–8578, 2023. 3

  10. [10]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,

  11. [11]

    Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds

    Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. InInternational Symposium on Visual Computing, pages 207–222. Springer, 2020. 5, 7

  12. [12]

    From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection.IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021

    Jiajun Deng, Wengang Zhou, Yanyong Zhang, and Houqiang Li. From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection.IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021. 2

  13. [13]

    Learning 3d semantic segmentation with only 2d image supervision

    Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Panto- faru, Forrester Cole, Avneesh Sud, Brian Brewington, Brian Shucker, and Thomas Funkhouser. Learning 3d semantic segmentation with only 2d image supervision. In2021 In- ternational Conference on 3D Vision (3DV), pages 361–372. IEEE, 2021. 5, 3

  14. [14]

    3d semantic segmentation with submani- fold sparse convolutional networks

    Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2

  15. [15]

    Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020

    Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020. 1

  16. [16]

    Mvtn: Multi-view transformation network for 3d shape recognition supplementary material

    Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition supplementary material. 2

  17. [17]

    3d-sis: 3d se- mantic instance segmentation of rgb-d scans

    Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 2

  18. [18]

    Randla-net: Efficient semantic segmentation of large-scale point clouds

    Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11108– 11117, 2020. 7

  19. [19]

    xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation

    Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emi- lie Wirbel, and Patrick P´erez. xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12605–12614, 2020. 3

  20. [20]

    Cross-modal learning for domain adaptation in 3d semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022

    Maximilian Jaritz, Tuan-Hung Vu, Raoul De Charette, ´Emilie Wirbel, and Patrick P ´erez. Cross-modal learning for domain adaptation in 3d semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022. 2, 3, 8

  21. [21]

    Pointgroup: Dual-set point grouping for 3d instance segmentation

    Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 2

  22. [22]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

  23. [23]

    Oneformer3d: One transformer for unified point cloud segmentation

    Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 1

  24. [24]

    Stratified trans- former for 3d point cloud segmentation

    Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans- former for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022. 2

  25. [25]

    Spherical transformer for lidar-based 3d recognition

    Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 17545–17555, 2023. 5, 3

  26. [26]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2

  27. [27]

    Effective sam combination for open-vocabulary semantic segmenta- tion

    Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Effective sam combination for open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26081–26090, 2025. 3

  28. [28]

    Memoryseg: Online lidar semantic segmentation with a latent memory

    Enxu Li, Sergio Casas, and Raquel Urtasun. Memoryseg: Online lidar semantic segmentation with a latent memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 745–754, 2023. 5

  29. [29]

    Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing

    Jiale Li, Hang Dai, Hao Han, and Yong Ding. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21694–21704,

  30. [30]

    Cross-domain and cross-modal knowledge distillation in domain adaptation for 3d semantic segmentation

    Miaoyu Li, Yachao Zhang, Yuan Xie, Zuodong Gao, Cui- hua Li, Zhizhong Zhang, and Yanyun Qu. Cross-domain and cross-modal knowledge distillation in domain adaptation for 3d semantic segmentation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3829–3837,

  31. [31]

    Bidirectional learning for domain adaptation of semantic segmentation

    Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6936–6945, 2019. 2, 8

  32. [32]

    Deep continuous fusion for multi-sensor 3d object detection

    Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. InProceedings of the European conference on computer vi- sion (ECCV), pages 641–656, 2018. 2

  33. [33]

    Unidxmd: Towards unified represen- tation for cross-modal unsupervised domain adaptation in 3d semantic segmentation

    Zhengyin Liang, Hui Yin, Min Liang, Qianqian Du, Ying Yang, and Hua Huang. Unidxmd: Towards unified represen- tation for cross-modal unsupervised domain adaptation in 3d semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 20346– 20356, 2025. 3, 8

  34. [34]

    Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning.ISPRS Journal of Photogrammetry and Remote Sensing, 176:211– 221, 2021

    Wei Liu, Zhiming Luo, Yuanzheng Cai, Ying Yu, Yang Ke, Jos´e Marcato Junior, Wesley Nunes Gonc ¸alves, and Jonathan Li. Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning.ISPRS Journal of Photogrammetry and Remote Sensing, 176:211– 221, 2021. 3, 8

  35. [35]

    Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase

    Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, et al. Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21662–21673, 2023. 2, 7, 8

  36. [36]

    Csfnet: Cross- modal semantic focus network for sematic segmentation of large-scale point clouds.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Yang Luo, Ting Han, Yujun Liu, Jinhe Su, Yiping Chen, Jinyuan Li, Yundong Wu, and Guorong Cai. Csfnet: Cross- modal semantic focus network for sematic segmentation of large-scale point clouds.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 5, 6, 7, 8

  37. [37]

    Paseg: positional-guided segmenter with multimodal semantic alignment for enhancing urban scene 3d semantic segmentation.International Journal of Digital Earth, 18(1):2528811, 2025

    Yang Luo, Ting Han, Xiaorong Zhang, Yujun Liu, Duxin Zhu, Jinyuan Li, Yiping Chen, Yundong Wu, Guorong Cai, Yingchao Piao, et al. Paseg: positional-guided segmenter with multimodal semantic alignment for enhancing urban scene 3d semantic segmentation.International Journal of Digital Earth, 18(1):2528811, 2025. 2

  38. [38]

    V oxnet: A 3d con- volutional neural network for real-time object recognition

    Daniel Maturana and Sebastian Scherer. V oxnet: A 3d con- volutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015. 2

  39. [39]

    Rangenet++: Fast and accurate lidar semantic segmentation

    Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4213–4220. IEEE, 2019. 5, 7

  40. [40]

    Minimal-entropy correlation alignment for unsupervised deep domain adaptation.arXiv preprint arXiv:1711.10288,

    Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation.arXiv preprint arXiv:1711.10288,

  41. [41]

    Sparse-to-dense feature matching: Intra and inter do- main cross-modal learning in domain adaptation for 3d se- mantic segmentation

    Duo Peng, Yinjie Lei, Wen Li, Pingping Zhang, and Yulan Guo. Sparse-to-dense feature matching: Intra and inter do- main cross-modal learning in domain adaptation for 3d se- mantic segmentation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7108–7117,

  42. [42]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  43. [43]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2

  44. [44]

    An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models

    Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, and Liang Xiao. An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27325–27335, 2025. 1

  45. [45]

    Deep sliding shapes for amodal 3d object detection in rgb-d images

    Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 808–816, 2016. 2

  46. [46]

    Multi-view convolutional neural networks for 3d shape recognition

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,

  47. [47]

    Tianfang Sun, Zhizhong Zhang, Xin Tan, Yong Peng, Yanyun Qu, and Yuan Xie. Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmen- tation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11059–11072, 2024. 5, 7, 3

  48. [48]

    Searching efficient 3d architec- tures with sparse point-voxel convolution

    Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architec- tures with sparse point-voxel convolution. InEuropean con- ference on computer vision, pages 685–702. Springer, 2020. 6, 7, 8, 3

  49. [49]

    Kpconv: Flexible and deformable convolution for point clouds

    Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2

  50. [50]

    Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation

    Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick P ´erez. Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2517–2526, 2019. 2, 8

  51. [51]

    Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud

    Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud. In2019 international confer- ence on robotics and automation (ICRA), pages 4376–4382. IEEE, 2019. 7

  52. [52]

    Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond

    Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, and Risheng Liu. Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17882–17891,

  53. [53]

    Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022

    Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022. 2, 6

  54. [54]

    Taseg: Temporal aggregation network for li- dar semantic segmentation

    Xiaopei Wu, Yuenan Hou, Xiaoshui Huang, Binbin Lin, Tong He, Xinge Zhu, Yuexin Ma, Boxi Wu, Haifeng Liu, Deng Cai, et al. Taseg: Temporal aggregation network for li- dar semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15311–15320, 2024. 5, 3

  55. [55]

    Point transformer v3: Simpler faster stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 2, 5, 7

  56. [56]

    Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation

    Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Jianping Fan, Zhongchao Shi, and Yanyun Qu. Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 490–498, 2023. 3, 8

  57. [57]

    Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior

    Yao Wu, Mingwei Xing, Yachao Zhang, Xiaotong Luo, Yuan Xie, and Yanyun Qu. Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior. Advances in Neural Information Processing Systems, 37: 101223–101249, 2024. 2, 3, 8

  58. [58]

    Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Kaibei Peng, and Yanyun Qu. Fusion-then-distillation: Toward cross-modal positive distillation for domain adaptive 3d se- mantic segmentation.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 3, 8

  59. [59]

    Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation

    Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2020. 7

  60. [60]

    Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation

    Jianyun Xu, Ruixiang Zhang, Jian Dou, Yushi Zhu, Jie Sun, and Shiliang Pu. Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16024–16033, 2021. 2

  61. [61]

    2dpass: 2d priors assisted semantic segmentation on lidar point clouds

    Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European conference on computer vision, pages 677–695. Springer, 2022. 2, 5, 6, 7, 3

  62. [62]

    Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018

    Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018. 2

  63. [63]

    Pixor: Real- time 3d object detection from point clouds

    Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 7652–7660, 2018. 2

  64. [64]

    Swin3D: A Pre- trained Transformer Backbone for 3D Indoor Scene Understanding,

    Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding.arXiv preprint arXiv:2304.06906,

  65. [65]

    Swin3d++: Effec- tive multi-source pretraining for 3d indoor scene understand- ing.arXiv preprint arXiv:2402.14215, 2024

    Yu-Qi Yang, Yu-Xiao Guo, and Yang Liu. Swin3d++: Effec- tive multi-source pretraining for 3d indoor scene understand- ing.arXiv preprint arXiv:2402.14215, 2024. 2

  66. [66]

    Lidarmultinet: Towards a unified multi-task network for lidar perception

    Dongqiangzi Ye, Zixiang Zhou, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarmultinet: Towards a unified multi-task network for lidar perception. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 3231–3240, 2023. 5, 3

  67. [67]

    Fa-resnet: Feature affine residual network for large-scale point cloud segmen- tation.International Journal of Applied Earth Observation and Geoinformation, 118:103259, 2023

    Lixin Zhan, Wei Li, and Weidong Min. Fa-resnet: Feature affine residual network for large-scale point cloud segmen- tation.International Journal of Applied Earth Observation and Geoinformation, 118:103259, 2023. 7

  68. [68]

    Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation

    Boxiang Zhang, Zunran Wang, Yonggen Ling, Yuanyuan Guan, Shenghao Zhang, and Wenhui Li. Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3401–3409, 2023. 3

  69. [69]

    Sptnet: Sparse convolution and transformer network for woody and foliage components sep- aration from point clouds.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024

    Shuai Zhang, Yiping Chen, Biao Wang, Dong Pan, Wum- ing Zhang, and Aiguang Li. Sptnet: Sparse convolution and transformer network for woody and foliage components sep- aration from point clouds.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024. 3

  70. [70]

    Shuai Zhang, Biao Wang, Yiping Chen, Shuhang Zhang, and Wuming Zhang. Point and voxel cross perception with lightweight cosformer for large-scale point cloud semantic segmentation.International Journal of Applied Earth Ob- servation and Geoinformation, 131:103951, 2024. 1

  71. [71]

    Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation

    Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze- rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9601–9610, 2020. 5, 3

  72. [72]

    Self-supervised exclusive learning for 3d segmentation with cross-modal un- supervised domain adaptation

    Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu. Self-supervised exclusive learning for 3d segmentation with cross-modal un- supervised domain adaptation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3338– 3346, 2022. 3, 8

  73. [73]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

  74. [74]

    Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis

    Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, and Kaizhu Huang. Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29395–29405, 2025. 1

  75. [75]

    Spherical frustum sparse con- volution network for lidar point cloud semantic segmenta- tion.Advances in Neural Information Processing Systems, 37:121827–121858, 2024

    Yu Zheng, Guangming Wang, Jiuming Liu, Marc Polle- feys, and Hesheng Wang. Spherical frustum sparse con- volution network for lidar point cloud semantic segmenta- tion.Advances in Neural Information Processing Systems, 37:121827–121858, 2024. 5

  76. [76]

    Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation

    Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, and Dahua Lin. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550, 2020. 5, 3

  77. [77]

    Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion

    Zixiang Zhou, Yang Zhang, and Hassan Foroosh. Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13194–13203,

  78. [78]

    Lidarformer: A unified transformer-based multi-task network for lidar per- ception

    Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarformer: A unified transformer-based multi-task network for lidar per- ception. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14740–14747. IEEE, 2024. 5, 3

  79. [79]

    Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting

    Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, and Chi-Wing Fu. Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3656–3665, 2025. 2

  80. [80]

    Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation

    Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9939–9948,

Showing first 80 references.