pith. machine review for the scientific record. sign in

arxiv: 2605.10706 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords RelFlexformer3D Transformerrelative positional encodingnon-uniform Fourier transformpoint cloudsefficient attention3D data modeling
0
0 comments X

The pith

RelFlexformers integrate arbitrary integrable relative positional encodings via non-uniform Fourier transforms to deliver O(L log L) attention for 3D points at irregular locations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RelFlexformers as a new family of 3D transformers that apply relative positional encodings defined by any integrable modulation function. These encodings are realized through the non-uniform fast Fourier transform, which keeps attention computation at O(L log L) cost even when input tokens sit at completely arbitrary positions in 3D space. The approach matters because many 3D datasets, especially point clouds, lack the regular grid structure that earlier efficient RPE methods required. By removing that structural constraint, the models generalize prior techniques while preserving or improving accuracy on downstream 3D tasks.

Core claim

RelFlexformers are 3D-Transformer models that flexibly integrate universal 3D Relative Positional Encoding methods given by arbitrary integrable modulation functions f, achieve O(L log L) attention complexity by leveraging the Non-Uniform Fourier Transform, and generalize existing RPE-attention techniques from homogeneous grid settings to arbitrary heterogeneous 3D position distributions such as point clouds.

What carries the argument

The Non-Uniform Fourier Transform (NU-FFT) applied to integrable modulation functions, which modulates attention scores for tokens at arbitrarily distributed 3D coordinates.

If this is right

  • Point-cloud modeling becomes directly feasible without forcing tokens onto uniform grids.
  • Attention for long 3D sequences remains practical because cost grows as O(L log L) rather than O(L squared).
  • Existing RPE methods for grid data extend automatically to heterogeneous 3D layouts.
  • Empirical tests on multiple 3D datasets show measurable quality gains from the flexible encodings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same NU-FFT construction could be tested in 2D or 4D settings to check whether the efficiency pattern generalizes beyond three dimensions.
  • Hybrid models that combine RelFlexformer blocks with other fast attention primitives might further reduce constants in practice.
  • Applications such as 3D scene reconstruction or molecular modeling could benefit from the removal of grid assumptions.
  • Error bounds on the NU-FFT approximation for specific families of modulation functions remain open for tighter analysis.

Load-bearing premise

Integrable modulation functions exist that can be evaluated via NU-FFT on arbitrary 3D positions without extra structure or approximation errors large enough to erase the claimed speed or quality gains.

What would settle it

Measure wall-clock attention time and downstream accuracy on a large irregular point-cloud dataset while scaling sequence length L; the claim holds if runtime stays linearithmic and accuracy does not drop below standard quadratic-attention baselines.

Figures

Figures reproduced from arXiv: 2605.10706 by Arijit Sehanobish, Avinava Dubey, Byeongchan Kim, Krzysztof Choromanski, Min-hwan Oh.

Figure 1
Figure 1. Figure 1: Execution time (ms) as a function of sequence length ( [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RelFlexformer Mask Behavior Analysis. Comparison of our proposed kernels against standard RBF and Laplace kernels. As the Euclidean distance between points increases, the kernel value decays smoothly, mimicking the behavior of Heat and Laplace kernels (dashed lines). The tight variance in the scatter highlights the stability of our projection method across varying spatial scales. Substituting this into the… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the quadrature size S per head across different datasets. The plots illustrate the performance trade-offs on NYU Depth v2 (left, mIoU%), ScanObjectNN v2 (middle, Overall Accuracy%), and SUN RGB-D (right, mIoU%) for λ = 2. the RPE mask M is blended from identity to full strength over the first N epochs via Mˆ = 1 + α (M − 1), α = 1 2 [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
read the original abstract

We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces RelFlexformers, a new class of 3D-Transformer models that integrate arbitrary integrable relative positional encodings (RPEs) defined by modulation functions f via the Non-Uniform Fast Fourier Transform (NU-FFT). This yields O(L log L) attention complexity for L-length sequences with arbitrarily distributed 3D token positions, generalizing prior efficient RPE methods from homogeneous grid settings to heterogeneous unstructured point clouds, with empirical quality gains demonstrated on a portfolio of 3D datasets.

Significance. If the NU-FFT approximation errors can be rigorously bounded to preserve both the stated complexity and performance, the work would meaningfully advance efficient attention for unstructured 3D data by unifying and extending grid-based RPE techniques into a flexible framework applicable to point clouds. The manuscript earns credit for grounding the approach in established NU-FFT theory and for conducting extensive empirical evaluation across multiple 3D datasets.

major comments (2)
  1. [§3] §3 (NU-FFT-based RPE integration): The central claim of exact O(L log L) complexity for arbitrary 3D position distributions rests on representing integrable modulation functions f via NU-FFT, yet the manuscript provides no explicit error bounds or analysis for the approximation parameters (oversampling factor, kernel support, grid size) specific to these f; this is load-bearing for the generalization from structured grids to heterogeneous scenarios and for the claimed complexity-quality tradeoff.
  2. [§5] §5 (Empirical evaluation): The reported quality improvements lack ablation or sensitivity analysis on the NU-FFT approximation parameters, leaving open whether the gains hold under varying point densities or when error is controlled to machine precision, which directly affects the practical validity of the heterogeneous 3D claims.
minor comments (3)
  1. [§2] The definition and integrability conditions on the modulation function f are introduced late; an explicit early statement with an example would improve readability.
  2. [Figure 2] Figure 2 (attention visualization): The caption does not specify the exact NU-FFT parameters used, making it difficult to reproduce the depicted modulation effects.
  3. [§1] Several citations to prior grid-based RPE methods (e.g., in §1) could be expanded with direct complexity comparisons to highlight the precise generalization achieved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important aspects of rigor in the theoretical claims and empirical validation. We address each point below and outline revisions that will strengthen the manuscript while preserving its core contributions on flexible 3D RPE via NU-FFT.

read point-by-point responses
  1. Referee: [§3] §3 (NU-FFT-based RPE integration): The central claim of exact O(L log L) complexity for arbitrary 3D position distributions rests on representing integrable modulation functions f via NU-FFT, yet the manuscript provides no explicit error bounds or analysis for the approximation parameters (oversampling factor, kernel support, grid size) specific to these f; this is load-bearing for the generalization from structured grids to heterogeneous scenarios and for the claimed complexity-quality tradeoff.

    Authors: We agree that a dedicated error analysis is necessary to fully support the generalization to heterogeneous 3D positions. The NU-FFT framework (as established in the literature, e.g., Greengard et al.) provides general error bounds controlled by the oversampling factor, kernel support width, and grid resolution, which can be made arbitrarily small for integrable f without changing the asymptotic O(L log L) complexity. However, the manuscript does not tailor these bounds explicitly to the modulation functions f used in RelFlexformers. In the revision, we will add a new subsection in §3 that (i) recalls the relevant NU-FFT error theorems, (ii) derives parameter-dependent bounds specific to the integrable f considered (including examples for common RPE kernels), and (iii) discusses how the approximation error trades off against the claimed complexity for unstructured point clouds. This will make the load-bearing assumptions explicit. revision: yes

  2. Referee: [§5] §5 (Empirical evaluation): The reported quality improvements lack ablation or sensitivity analysis on the NU-FFT approximation parameters, leaving open whether the gains hold under varying point densities or when error is controlled to machine precision, which directly affects the practical validity of the heterogeneous 3D claims.

    Authors: We acknowledge that the current empirical section does not include sensitivity analysis on the NU-FFT hyperparameters. To close this gap, the revised §5 will incorporate additional ablation studies that (i) vary the oversampling factor, kernel support, and grid size across the portfolio of 3D datasets (including point clouds with heterogeneous densities), (ii) report performance when the NU-FFT approximation error is driven to near machine precision, and (iii) compare against the default parameters used in the main results. These experiments will demonstrate that the observed quality gains are robust and not artifacts of particular approximation settings, thereby reinforcing the practical validity of the heterogeneous 3D claims. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external NU-FFT theory for O(L log L) complexity

full rationale

The paper defines RelFlexformers via application of integrable modulation functions f through NU-FFT to achieve O(L log L) attention for arbitrary 3D point distributions. This builds directly on established non-uniform FFT properties (external to the paper) rather than re-deriving or fitting them from the model's own outputs. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation forms the load-bearing uniqueness argument, and the generalization from grid RPEs is presented as an extension using prior NU-FFT results. The abstract and description contain no self-referential definitions or renamings that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the integrability of the modulation functions and the applicability of NU-FFT to arbitrary 3D position distributions; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Modulation functions f are integrable
    Explicitly required for the universal 3D RPE method described in the abstract.
  • domain assumption NU-FFT can be applied to arbitrarily distributed token positions in 3D without loss of the stated complexity
    Central to generalizing from grid-based to heterogeneous scenarios.

pith-pipeline@v0.9.0 · 5499 in / 1341 out tokens · 39758 ms · 2026-05-12T05:21:30.596564+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 2 internal anchors

  1. [1]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016

  2. [2]

    Point convolutional neural networks by extension operators.arXiv preprint arXiv:1803.10091, 2018

    Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators.arXiv preprint arXiv:1803.10091, 2018

  3. [3]

    Multimae: Multi-modal multi-task masked autoencoders

    Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean conference on computer vision, pages 348–367. Springer, 2022

  4. [4]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document trans- former.arXiv:2004.05150, 2020

  5. [5]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  6. [6]

    Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation

    Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7088–7097, 2021

  7. [7]

    Spatial information guided convolution for real-time rgbd semantic segmentation.IEEE Transactions on Image Processing, 30:2313–2324, 2021

    Lin-Zhuo Chen, Zheng Lin, Ziqin Wang, Yong-Liang Yang, and Ming-Ming Cheng. Spatial information guided convolution for real-time rgbd semantic segmentation.IEEE Transactions on Image Processing, 30:2313–2324, 2021. 10

  8. [8]

    Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation

    Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang Zeng. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. InEuropean conference on computer vision, pages 561–577. Springer, 2020

  9. [9]

    Scaling up kernels in 3d cnns.arXiv preprint arXiv:2206.10555, 1(2):5, 2022

    Yukang Chen, Jianhui Liu, Xiaojuan Qi, Xiangyu Zhang, Jian Sun, and Jiaya Jia. Scaling up kernels in 3d cnns.arXiv preprint arXiv:2206.10555, 1(2):5, 2022

  10. [10]

    A unified point- based framework for 3d segmentation

    Hung-Yueh Chiang, Yen-Liang Lin, Yueh-Cheng Liu, and Winston H Hsu. A unified point- based framework for 3d segmentation. In2019 International Conference on 3D Vision (3DV), pages 155–163. IEEE, 2019

  11. [11]

    From block-toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers

    Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamás Sarlós, Adrian Weller, and Thomas Weingarten. From block-toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Sz...

  12. [12]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

  13. [13]

    Fast tree-field integra- tors: From low displacement rank to topological transformers

    Krzysztof Marcin Choromanski, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Han Lin, Kumar Avinava Dubey, Tamas Sarlos, and Snigdha Chaturvedi. Fast tree-field integra- tors: From low displacement rank to topological transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  14. [14]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019

  15. [15]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  16. [16]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  17. [17]

    Fast kernel methods: Sobolev, physics-informed, and additive models, 2025

    Nathan Doumèche, Francis Bach, Gérard Biau, and Claire Boyer. Fast kernel methods: Sobolev, physics-informed, and additive models, 2025

  18. [18]

    Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d se- mantic segmentation

    Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, and Shengjun Tang. Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d se- mantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7608–7615, 2024

  19. [19]

    American Mathematical Soc., 2009

    Gerald B Folland.Fourier analysis and its applications, volume 4. American Mathematical Soc., 2009

  20. [20]

    Omnivore: A single model for many visual modalities

    Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A single model for many visual modalities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16102–16112, 2022

  21. [21]

    3d semantic segmentation with submanifold sparse convolutional networks

    Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9224–9232, 2018

  22. [22]

    Accelerating the nonuniform fast fourier transform.SIAM Review, 46(3):443–454, 2004

    Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast fourier transform.SIAM Review, 46(3):443–454, 2004. 11

  23. [23]

    Martin, and Shi-Min Hu

    Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational Visual Media, 7(2):187–199, Apr 2021

  24. [24]

    Learning rich features from rgb-d images for object detection and segmentation

    Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. InEuropean conference on computer vision, pages 345–360. Springer, 2014

  25. [25]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations, 2021

  26. [26]

    Point-to-voxel knowledge distillation for lidar semantic segmentation

    Yuenan Hou, Xinge Zhu, Yuexin Ma, Chen Change Loy, and Yikang Li. Point-to-voxel knowledge distillation for lidar semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

  27. [27]

    Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation

    Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In2019 IEEE international conference on image processing (ICIP), pages 1440–1444. IEEE, 2019

  28. [28]

    Fourier position embedding: enhancing atten- tion’s periodic extension for length generalization

    Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: enhancing atten- tion’s periodic extension for length generalization. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025

  29. [29]

    Hierarchical point-edge interaction network for point cloud semantic segmentation

    Li Jiang, Hengshuang Zhao, Shu Liu, Xiaoyong Shen, Chi-Wing Fu, and Jiaya Jia. Hierarchical point-edge interaction network for point cloud semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  30. [30]

    Transformers are rnns: fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

  31. [31]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

  32. [32]

    Rethinking range view representation for lidar segmentation

    Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei Liu. Rethinking range view representation for lidar segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  33. [33]

    Spherical transformer for lidar-based 3d recognition

    Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

  34. [34]

    Stratified transformer for 3d point cloud segmentation

    Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified transformer for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022

  35. [35]

    Large-scale point cloud semantic segmentation with superpoint graphs

    Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

  36. [36]

    Pointgrid: A deep network for 3d shape understanding

    Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9204–9214, 2018

  37. [37]

    Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel

    Huan Lei, Naveed Akhtar, and Ajmal Mian. Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

  38. [38]

    Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018

    Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018. 12

  39. [39]

    Pamba: enhancing global interaction in point clouds via state space model

    Zhuoyuan Li, Yubo Ai, Jiahao Lu, ChuXin Wang, Jiacheng Deng, Hanzhi Chang, Yanzhe Liang, Wenfei Yang, Shifeng Zhang, and Tianzhu Zhang. Pamba: enhancing global interaction in point clouds via state space model. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5092–5100, 2025

  40. [40]

    Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

    Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

  41. [41]

    Meta architecture for point cloud analysis

    Haojia Lin, Xiawu Zheng, Lijiang Li, Fei Chao, Shanshan Wang, Yan Wang, Yonghong Tian, and Rongrong Ji. Meta architecture for point cloud analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17682–17691, 2023

  42. [42]

    Masked discrimination for self-supervised learning on point clouds

    Haotian Liu, Mu Cai, and Yong Jae Lee. Masked discrimination for self-supervised learning on point clouds. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

  43. [43]

    Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy.arXiv preprint arXiv:2403.06467, 2024

    Jiuming Liu, Ruiji Yu, Yian Wang, Yu Zheng, Tianchen Deng, Weicai Ye, and Hesheng Wang. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy.arXiv preprint arXiv:2403.06467, 2024

  44. [44]

    Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network

    Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. InProceedings of the AAAI conference on artificial intelligence, pages 8778–8785, 2019

  45. [45]

    Relation-shape convolutional neural network for point cloud analysis

    Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8895–8904, 2019

  46. [46]

    Multi-space alignments towards universal lidar segmentation

    Youquan Liu, Lingdong Kong, Xiaoyang Wu, Runnan Chen, Xin Li, Liang Pan, Ziwei Liu, and Yuexin Ma. Multi-space alignments towards universal lidar segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  47. [47]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  48. [48]

    Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

    Dening Lu, Qian Xie, Mingqiang Wei, Kyle Gao, Linlin Xu, and Jonathan Li. Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

  49. [49]

    Stable, fast and accurate: Kernelized attention with relative positional encoding

    Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, and Tie-Yan Liu. Stable, fast and accurate: Kernelized attention with relative positional encoding. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22795–...

  50. [50]

    arXiv preprint arXiv:2202.07123 , year=

    Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

  51. [51]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

  52. [52]

    Masked autoencoders for 3d point cloud self-supervised learning.World Scientific Annual Review of Artificial Intelligence, 1:2440001, 2023

    Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self-supervised learning.World Scientific Annual Review of Artificial Intelligence, 1:2440001, 2023

  53. [53]

    Oa-cnns: Omni-adaptive sparse cnns for 3d semantic segmentation

    Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Hengshuang Zhao, Zhuotao Tian, and Jiaya Jia. Oa-cnns: Omni-adaptive sparse cnns for 3d semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  54. [54]

    Pointcept: A codebase for point cloud perception research

    Pointcept Contributors. Pointcept: A codebase for point cloud perception research. https: //github.com/Pointcept/Pointcept, 2023. 13

  55. [55]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022

  56. [56]

    Qi, Hao Su, Kaichun Mo, and Leonidas J

    Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017

  57. [57]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  58. [58]

    Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems, 35:23192–23204, 2022

    Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems, 35:23192–23204, 2022

  59. [59]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020

  60. [60]

    Turner, René Wagner, Adrian Weller, and Krzysztof Marcin Choromanski

    Isaac Reid, Kumar Avinava Dubey, Deepali Jain, William F Whitney, Amr Ahmed, Joshua Ainslie, Alex Bewley, Mithun George Jacob, Aranyak Mehta, David Rendleman, Connor Schenck, Richard E. Turner, René Wagner, Adrian Weller, and Krzysztof Marcin Choromanski. Linear transformer topological masking with graph random features. InThe Thirteenth International Con...

  61. [61]

    Efficient 3d semantic segmentation with su- perpoint transformer

    Damien Robert, Hugo Raguet, and Loic Landrieu. Efficient 3d semantic segmentation with su- perpoint transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  62. [62]

    Language-grounded indoor 3d semantic segmentation in the wild

    David Rozenberszki, Or Litany, and Angela Dai. Language-grounded indoor 3d semantic segmentation in the wild. InEuropean conference on computer vision, pages 125–141. Springer, 2022

  63. [63]

    Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, and Krzysztof Marcin Choromanski

    Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Kumar Avinava Dubey, Ayzaan Wahid, Sumeet Singh, René Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey A. Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, and Krzysztof Marcin...

  64. [64]

    Efficient multi-task rgb-d scene analysis for indoor environments

    Daniel Seichter, Söhnke Benedikt Fischedick, Mona Köhler, and Horst-Michael Groß. Efficient multi-task rgb-d scene analysis for indoor environments. In2022 International joint conference on neural networks (IJCNN), pages 1–10. IEEE, 2022

  65. [65]

    Efficient rgb-d semantic segmentation for indoor scene analysis

    Daniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. Efficient rgb-d semantic segmentation for indoor scene analysis. In2021 IEEE international conference on robotics and automation (ICRA), pages 13525–13531. IEEE, 2021

  66. [66]

    Self-attention with relative position representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Lo...

  67. [67]

    Sun rgb-d: A rgb-d scene under- standing benchmark suite

    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene under- standing benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

  68. [68]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. 14

  69. [69]

    Searching efficient 3d architectures with sparse point-voxel convolution

    Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architectures with sparse point-voxel convolution. InEuropean confer- ence on computer vision, 2020

  70. [70]

    Tangent convolutions for dense prediction in 3d

    Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

  71. [71]

    Segcloud: Semantic segmentation of 3d point clouds

    Lyne Tchapmi, Christopher Choy, Iro Armeni, Jun Young Gwak, and Silvio Savarese. Segcloud: Semantic segmentation of 3d point clouds. InProceedings of the International Conference on 3D Vision (3DV), 2017

  72. [72]

    Kpconv: Flexible and deformable convolution for point clouds

    Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019

  73. [73]

    Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

    Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019

  74. [74]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc

  75. [75]

    Graph attention convo- lution for point cloud semantic segmentation

    Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convo- lution for point cloud semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019

  76. [76]

    Octformer: Octree-based transformers for 3d point clouds.ACM Transac- tions on Graphics (TOG), 42(4):1–11, 2023

    Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds.ACM Transac- tions on Graphics (TOG), 42(4):1–11, 2023

  77. [77]

    Deep parametric continuous convolutional neural networks

    Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

  78. [78]

    Multimodal token fusion for vision transformers

    Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. Multimodal token fusion for vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12186–12195, 2022

  79. [79]

    Deep multimodal fusion by channel exchanging.Advances in neural information processing systems, 33:4835–4845, 2020

    Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep multimodal fusion by channel exchanging.Advances in neural information processing systems, 33:4835–4845, 2020

  80. [80]

    Sarma, Michael M

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds. InACM Transactions on Graphics (SIGGRAPH Asia) 38(5), pages 146:1–146:12, 2019

Showing first 80 references.