pith. sign in

arxiv: 2605.24243 · v1 · pith:UDKB6CQZnew · submitted 2026-05-22 · 💻 cs.CV · cs.AI· stat.ML

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

Pith reviewed 2026-06-30 15:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIstat.ML
keywords 3D semantic segmentationgeometric inductive biaslightweight layerpoint cloud processingarchitecture-agnosticlearnable primitivesscene understanding
0
0 comments X

The pith

GIBLy adds a lightweight layer to any 3D segmentation architecture that supplies features aligned with simple geometric shapes to raise accuracy at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GIBLy as an add-on layer that injects learnable geometric priors into existing 3D semantic segmentation models. Current deep networks capture basic shapes only indirectly through scale and data volume, which raises training costs and can limit generalization. GIBLy supplies aligned features from simple geometric shapes in a form that plugs into MLP, convolution, or transformer backbones without architecture-specific changes. Experiments report consistent gains on multiple benchmarks, including an 11.5 percent mIoU increase on TS40K when paired with PTV3, while adding only 58K parameters. The core argument is that explicit encoding of geometric structure supports more accurate and efficient 3D scene understanding.

Core claim

GIBLy is a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. It enhances existing architectures by providing features aligned with simple geometric shapes that improve segmentation performance with minimal computational overhead. Validation across multiple benchmarks shows consistent performance gains, including up to 11.5 percent mIoU on TS40K with PTV3 while adding only 58K extra parameters.

What carries the argument

The GIBLy layer, which supplies features aligned with simple geometric shapes to the model backbone.

If this is right

  • The same layer produces gains when attached to MLP-based, convolution-based, and transformer-based backbones.
  • Consistent accuracy lifts appear across several standard 3D semantic segmentation datasets.
  • The added cost stays low at roughly 58K parameters regardless of the host architecture.
  • The supplied features remain aligned with human-interpretable geometric shapes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smaller overall models could reach performance levels that currently require much larger networks in 3D tasks.
  • The same plug-in approach may transfer to other 3D problems such as object detection or instance segmentation.
  • The human-interpretable geometric features could support post-hoc inspection of model decisions on point clouds.

Load-bearing premise

The observed performance gains are produced by the geometric inductive bias rather than by the simple addition of new parameters or altered training dynamics.

What would settle it

Replace the geometric parameters inside GIBLy with random values of the same count, retrain the same backbones, and check whether the reported mIoU gains on TS40K and other benchmarks disappear.

Figures

Figures reproduced from arXiv: 2605.24243 by Alessandra Micheletti, Cl\`audia Soares, Diogo Lavado.

Figure 1
Figure 1. Figure 1: GIBLy injects learnable geometric priors to improve 3D understanding. Left: A Cylinder geometric inductive bias (GIB) aligns to a chair leg (neighborhood Nq), using learned orientation ϕ and radius r, producing an alignment score. Right: On TS40K [21], adding a GIB-Layer (GIBLy) single-handedly boosts mIoU across multiple backbones (up to +11.5% mIoU on PTV3 [49]) with only 58K extra parameters. ing models… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. (a) Our GIBLy module injects geometric awareness into point features by applying a set of learnable geometric inductive biases (GIBs) at each query point. We adopt a multi-region design: all input points are treated as query points, and multiple neighborhood scales are considered per point. For a given query i, gi,N denotes the GIB alignment scores computed at N neighborhoods. These… view at source ↗
Figure 3
Figure 3. Figure 3: A normalized bias ensures that neighbors aligning with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the TS40K dataset. Each row presents a different scene. From left to right: input point cloud, prediction from the baseline PointTransformerV3 (PTV3), and prediction from PTV3 augmented with GIBLy. In the three rows, the baseline fails to detect support towers, introduces spuri￾ous noise in vegetation regions, or misclassifies artifacts as vegeta￾tion. In contrast, the GIBLy-augmente… view at source ↗
read the original abstract

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GIBLy, a lightweight add-on layer that injects learnable geometric priors (aligned with simple shapes) into existing 3D semantic segmentation architectures (MLP-, CNN-, or transformer-based). It claims consistent mIoU gains across benchmarks, including +11.5% on TS40K with PTV3, while adding only 58K parameters and negligible compute, by providing human-interpretable geometric features that reduce reliance on large models or extra data.

Significance. If the reported gains are shown to stem specifically from the geometric inductive bias (rather than added capacity), the method would offer a practical, architecture-agnostic way to encode basic 3D structure in segmentation pipelines. This could support more efficient training and better generalization on geometric scenes, with the small parameter count making it easy to adopt as a plug-in module.

major comments (2)
  1. [Experiments] Experiments section (and abstract): the central attribution of mIoU gains (e.g., +11.5% on TS40K with PTV3) to the geometric inductive bias is not isolated from the effect of inserting any 58K-parameter module. No ablation compares GIBLy against a parameter-matched non-geometric control (random features, plain MLP, or frozen weights) at the identical insertion point, nor reports architecture-specific retuning ablations. This leaves open whether the improvement arises from the learnable priors or from optimization dynamics of the added capacity.
  2. [Method and Results] § on method and results: the claim that GIBLy is 'architecture-agnostic' and integrates 'cleanly' into arbitrary backbones lacks evidence on whether insertion requires per-architecture hyperparameter retuning or produces hidden conflicts in feature scales or gradient flow. Without such checks the generality assertion remains untested.
minor comments (2)
  1. [Abstract] Abstract and §1: performance numbers are stated without reference to exact baselines, data splits, number of runs, or statistical significance; adding these details would strengthen the claims even if the ablations above are the primary concern.
  2. [Method] Notation: the description of 'learnable geometric priors' and how they are aligned with 'simple geometric shapes' would benefit from an explicit equation or diagram showing the prior parameterization and the feature-alignment operation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Experiments] the central attribution of mIoU gains to the geometric inductive bias is not isolated from the effect of inserting any 58K-parameter module. No ablation compares GIBLy against a parameter-matched non-geometric control.

    Authors: We agree that the current experiments do not include a parameter-matched non-geometric control (e.g., random features or plain MLP) at the identical insertion point. This is a valid concern for isolating the geometric prior's contribution. In the revised manuscript we will add such ablations across the reported backbones and datasets, along with frozen-weight controls, to directly address whether gains arise from the geometric alignment rather than added capacity. revision: yes

  2. Referee: [Method and Results] the claim that GIBLy is 'architecture-agnostic' and integrates 'cleanly' lacks evidence on whether insertion requires per-architecture hyperparameter retuning or produces hidden conflicts in feature scales or gradient flow.

    Authors: Our experiments already demonstrate integration into MLP-, CNN-, and transformer-based models with consistent gains using the same default hyperparameters and insertion strategy. However, we did not explicitly analyze retuning requirements or potential scale/gradient issues. We will add a dedicated subsection with gradient-norm statistics, feature-scale comparisons before/after insertion, and a note on whether any architecture-specific adjustments were needed, to better substantiate the agnostic claim. revision: yes

Circularity Check

0 steps flagged

No circularity: additive layer with empirical validation, no derivations or self-referential reductions

full rationale

The paper presents GIBLy as an architecture-agnostic additive layer that injects learnable geometric priors into existing 3D segmentation backbones (MLP, conv, transformer). No equations, uniqueness theorems, or derivation chains appear in the provided text that would reduce the reported mIoU gains or feature alignments to quantities defined by the method's own fitted parameters. Performance claims rest on benchmark experiments rather than any self-definitional or fitted-input-called-prediction structure. Self-citations, if present, are not load-bearing for the central claim, which remains an independent empirical proposal rather than a closed loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about the utility of primitive geometric shapes and introduces learnable parameters inside the new layer; no invented physical entities are postulated.

free parameters (1)
  • learnable geometric priors
    Parameters inside GIBLy are trained to align with simple shapes and are therefore fitted during optimization.
axioms (1)
  • domain assumption Features aligned with simple geometric shapes improve 3D semantic segmentation when added to existing models
    The abstract states that providing such features yields consistent gains across architectures.

pith-pipeline@v0.9.1-grok · 5732 in / 1315 out tokens · 41343 ms · 2026-06-30T15:33:43.205243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    3d seman- tic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioan- nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman- tic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1534–1543, 2016. 2, 6, 7

  2. [2]

    A model of inductive bias learning.Journal of artificial intelligence research, 12:149–198, 2000

    Jonathan Baxter. A model of inductive bias learning.Journal of artificial intelligence research, 12:149–198, 2000. 3

  3. [3]

    Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,

  4. [4]

    Geometry-informed neural networks.arXiv preprint arXiv:2402.14009, 2024

    Arturs Berzins, Andreas Radler, Eric V olkmann, Sebas- tian Sanokowski, Sepp Hochreiter, and Johannes Brandstet- ter. Geometry-informed neural networks.arXiv preprint arXiv:2402.14009, 2024. 3

  5. [5]

    Geneonet: A new machine learning paradigm based on group equivariant non-expansive operators

    Giovanni Bocchi, Patrizio Frosini, Alessandra Micheletti, Alessandro Pedretti, Carmen Gratteri, Filippo Lunghini, An- drea Rosario Beccari, and Carmine Talarico. Geneonet: A new machine learning paradigm based on group equivariant non-expansive operators. an application to protein pocket de- tection.arXiv preprint arXiv:2202.00451, 2022. 3

  6. [6]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 6, 7

  7. [7]

    Non- parametric boundary geometry in physics informed deep learning.Advances in Neural Information Processing Sys- tems, 36, 2024

    Scott Cameron, Arnu Pretorius, and Stephen Roberts. Non- parametric boundary geometry in physics informed deep learning.Advances in Neural Information Processing Sys- tems, 36, 2024. 3

  8. [8]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,

  9. [9]

    Inductive Bias of Deep Convolutional Networks through Pooling Geometry

    Nadav Cohen and Amnon Shashua. Inductive bias of deep convolutional networks through pooling geometry.arXiv preprint arXiv:1605.06743, 2016. 3

  10. [10]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 6, 7

  11. [11]

    Inductive biases for deep learning of higher-level cognition.Proceedings of the Royal Society A, 478(2266):20210068, 2022

    Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition.Proceedings of the Royal Society A, 478(2266):20210068, 2022. 3

  12. [12]

    Flex-convolution: Million-scale point-cloud learning beyond grid-worlds

    Fabian Groh, Patrick Wieschollek, and Hendrik PA Lensch. Flex-convolution: Million-scale point-cloud learning beyond grid-worlds. InAsian Conference on Computer Vision, pages 105–122. Springer, 2018. 2

  13. [13]

    Meshcnn: a network with an edge.ACM Transactions on Graphics (ToG), 38(4):1–12,

    Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. Meshcnn: a network with an edge.ACM Transactions on Graphics (ToG), 38(4):1–12,

  14. [14]

    Monte carlo convolution for learning on non-uniformly sampled point clouds.ACM Transactions On Graphics (TOG), 37(6):1–12, 2018

    Pedro Hermosilla, Tobias Ritschel, Pere-Pau V ´azquez, `Alvar Vinacua, and Timo Ropinski. Monte carlo convolution for learning on non-uniformly sampled point clouds.ACM Transactions On Graphics (TOG), 37(6):1–12, 2018. 2

  15. [15]

    Rethinking range view representation for lidar segmentation

    Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei Liu. Rethinking range view representation for lidar segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 228–240, 2023. 2

  16. [16]

    Stratified trans- former for 3d point cloud segmentation

    Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans- former for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022. 1, 2

  17. [17]

    Spherical transformer for lidar-based 3d recognition

    Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17545–17555, 2023. 2

  18. [18]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2

  19. [19]

    Low-resource white-box semantic segmentation of supporting towers on 3d point clouds via signature shape identification.arXiv preprint arXiv:2306.07809, 2023

    Diogo Lavado, Cl ´audia Soares, Alessandra Micheletti, Gio- vanni Bocchi, Alex Coronati, Manuel Silva, and Patrizio Frosini. Low-resource white-box semantic segmentation of supporting towers on 3d point clouds via signature shape identification.arXiv preprint arXiv:2306.07809, 2023. 3

  20. [20]

    Scene-net v2: Interpretable multiclass 3d scene understand- ing with geometric priors.PROCEEDINGS OF MACHINE LEARNING RESEARCH, 251:222–232, 2024

    Diogo Lavado, Cl ´audia Soares, Alessandra Micheletti, et al. Scene-net v2: Interpretable multiclass 3d scene understand- ing with geometric priors.PROCEEDINGS OF MACHINE LEARNING RESEARCH, 251:222–232, 2024. 3

  21. [21]

    Learning under noisy labels, spurious points, and diverse structures: Ts40k, a 3d point cloud dataset of rural terrain and electrical trans- mission systems

    Diogo Lavado, Ricardo Santos, Andr ´e Coelho, Jo˜ao Santos, Alessandra Micheletti, and Claudia Soares. Learning under noisy labels, spurious points, and diverse structures: Ts40k, a 3d point cloud dataset of rural terrain and electrical trans- mission systems. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7326–7336. IE...

  22. [22]

    Towards under- standing inductive bias in transformers: A view from infinity

    Itay Lavie, Guy Gur-Ari, and Zohar Ringel. Towards under- standing inductive bias in transformers: A view from infinity. InProceedings of the 41st International Conference on Ma- chine Learning, pages 26043–26069. PMLR, 2024. 3 9

  23. [23]

    Deep projective 3d semantic segmentation

    Felix J ¨aremo Lawin, Martin Danelljan, Patrik Tosteberg, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Deep projective 3d semantic segmentation. InComputer Analysis of Images and Patterns: 17th International Confer- ence, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Pro- ceedings, Part I 17, pages 95–107. Springer, 2017. 2

  24. [24]

    Pointgrid: A deep network for 3d shape understanding

    Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9204– 9214, 2018. 2

  25. [25]

    Octree guided cnn with spherical kernels for 3d point clouds

    Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided cnn with spherical kernels for 3d point clouds. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9631–9640, 2019. 2

  26. [26]

    Spherical kernel for efficient graph convolution on 3d point clouds.IEEE transactions on pattern analysis and machine intelligence, 43(10):3664–3680, 2020

    Huan Lei, Naveed Akhtar, and Ajmal Mian. Spherical kernel for efficient graph convolution on 3d point clouds.IEEE transactions on pattern analysis and machine intelligence, 43(10):3664–3680, 2020. 2

  27. [27]

    Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018

    Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018. 1, 2, 4

  28. [28]

    Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36, 2024

    Zongyi Li, Nikola Kovachki, Chris Choy, Boyi Li, Jean Kossaifi, Shourya Otta, Mohammad Amin Nabian, Maxi- milian Stadler, Christian Hundt, Kamyar Azizzadenesheli, et al. Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36, 2024. 3

  29. [29]

    Learning to segment 3d point clouds in 2d image space

    Yecheng Lyu, Xinming Huang, and Ziming Zhang. Learning to segment 3d point clouds in 2d image space. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12255–12264, 2020. 2

  30. [30]

    V oxnet: A 3d con- volutional neural network for real-time object recognition

    Daniel Maturana and Sebastian Scherer. V oxnet: A 3d con- volutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015. 1, 2

  31. [31]

    Vv-net: V oxel vae net with group convolutions for point cloud segmentation

    Hsien-Yu Meng, Lin Gao, Yu-Kun Lai, and Dinesh Manocha. Vv-net: V oxel vae net with group convolutions for point cloud segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 8500– 8508, 2019. 1, 2

  32. [32]

    Geometry aware physics in- formed neural network surrogate for solving navier–stokes equation (gapinn).Advanced Modeling and Simulation in Engineering Sciences, 9(1):8, 2022

    Jan Oldenburg, Finja Borowski, Alper ¨Oner, Klaus-Peter Schmitz, and Michael Stiehm. Geometry aware physics in- formed neural network surrogate for solving navier–stokes equation (gapinn).Advanced Modeling and Simulation in Engineering Sciences, 9(1):8, 2022. 3

  33. [33]

    Fast point transformer

    Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jae- sik Park. Fast point transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16949–16958, 2022. 2

  34. [34]

    Oa-cnns: Omni- adaptive sparse cnns for 3d semantic segmentation

    Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Heng- shuang Zhao, Zhuotao Tian, and Jiaya Jia. Oa-cnns: Omni- adaptive sparse cnns for 3d semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21305–21315, 2024. 1, 2

  35. [35]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  36. [36]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 1, 2, 6, 7

  37. [37]

    Dynamic edge- conditioned filters in convolutional neural networks on graphs

    Martin Simonovsky and Nikos Komodakis. Dynamic edge- conditioned filters in convolutional neural networks on graphs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017. 2

  38. [38]

    Multi-view convolutional neural networks for 3d shape recognition

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,

  39. [39]

    Canonical capsules: Self-supervised cap- sules in canonical pose.Advances in Neural information processing systems, 34:24993–25005, 2021

    Weiwei Sun, Andrea Tagliasacchi, Boyang Deng, Sara Sabour, Soroosh Yazdani, Geoffrey E Hinton, and Kwang Moo Yi. Canonical capsules: Self-supervised cap- sules in canonical pose.Advances in Neural information processing systems, 34:24993–25005, 2021. 3

  40. [40]

    Kpconv: Flexible and deformable convolution for point clouds

    Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 1, 2, 6, 7

  41. [41]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1, 2

  42. [42]

    Graph attention convolution for point cloud se- mantic segmentation

    Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10296–10305, 2019. 2

  43. [43]

    Octformer: Octree-based transformers for 3d point clouds.ACM Transactions on Graphics (TOG), 42(4):1–11, 2023

    Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds.ACM Transactions on Graphics (TOG), 42(4):1–11, 2023. 1, 2

  44. [44]

    O-cnn: Octree-based convolutional neu- ral networks for 3d shape analysis.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017

    Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neu- ral networks for 3d shape analysis.ACM Transactions On Graphics (TOG), 36(4):1–11, 2017. 2

  45. [45]

    Deep parametric continu- ous convolutional neural networks

    Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continu- ous convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2589–2597, 2018. 2

  46. [46]

    Theoretical analysis of the induc- tive biases in deep convolutional networks.Advances in Neu- ral Information Processing Systems, 36:74289–74338, 2023

    Zihao Wang and Lei Wu. Theoretical analysis of the induc- tive biases in deep convolutional networks.Advances in Neu- ral Information Processing Systems, 36:74289–74338, 2023. 3

  47. [47]

    Pointconv: Deep convolutional networks on 3d point clouds

    Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9621–9630, 2019. 1, 2, 4

  48. [48]

    Point transformer v2: Grouped vector atten- tion and partition-based pooling

    Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling. InNeurIPS, 2022. 1, 2, 6, 7 10

  49. [49]

    Point transformer v3: Simpler faster stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4840–4851, 2024. 1, 2, 6, 7, 8

  50. [50]

    Permutation equivariance of trans- formers and its applications

    Hengyuan Xu, Liyao Xiang, Hangyu Ye, Dixi Yao, Pengzhi Chu, and Baochun Li. Permutation equivariance of trans- formers and its applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5987–5996, 2024. 3

  51. [51]

    Spidercnn: Deep learning on point sets with parameterized convolutional filters

    Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. InProceedings of the European con- ference on computer vision (ECCV), pages 87–102, 2018. 2, 4

  52. [52]

    $SE(3)$ equivariant convolution and transformer in ray space

    Yinshuang Xu, Jiahui Lei, and Kostas Daniilidis. $SE(3)$ equivariant convolution and transformer in ray space. In Thirty-seventh Conference on Neural Information Process- ing Systems, 2023. 3

  53. [53]

    Learning relationships for multi- view 3d object recognition

    Ze Yang and Liwei Wang. Learning relationships for multi- view 3d object recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 7505– 7514, 2019. 2

  54. [54]

    Input-level inductive biases for 3d reconstruction

    Wang Yifan, Carl Doersch, Relja Arandjelovi ´c, Joao Car- reira, and Andrew Zisserman. Input-level inductive biases for 3d reconstruction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6176–6186, 2022. 3

  55. [55]

    Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation

    Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze- rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9601–9610, 2020. 2

  56. [56]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 1, 2, 6, 7

  57. [57]

    Cylindrical and asymmetrical 3d convolution networks for lidar-based perception.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(10):6807–6822, 2021

    Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Wei Li, Yuexin Ma, Hongsheng Li, Ruigang Yang, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidar-based perception.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(10):6807–6822, 2021. 2 11