arxiv: 2605.10706 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

Byeongchan Kim , Arijit Sehanobish , Avinava Dubey , Min-hwan Oh , Krzysztof Choromanski

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords RelFlexformer3D Transformerrelative positional encodingnon-uniform Fourier transformpoint cloudsefficient attention3D data modeling

0 comments

The pith

RelFlexformers integrate arbitrary integrable relative positional encodings via non-uniform Fourier transforms to deliver O(L log L) attention for 3D points at irregular locations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RelFlexformers as a new family of 3D transformers that apply relative positional encodings defined by any integrable modulation function. These encodings are realized through the non-uniform fast Fourier transform, which keeps attention computation at O(L log L) cost even when input tokens sit at completely arbitrary positions in 3D space. The approach matters because many 3D datasets, especially point clouds, lack the regular grid structure that earlier efficient RPE methods required. By removing that structural constraint, the models generalize prior techniques while preserving or improving accuracy on downstream 3D tasks.

Core claim

RelFlexformers are 3D-Transformer models that flexibly integrate universal 3D Relative Positional Encoding methods given by arbitrary integrable modulation functions f, achieve O(L log L) attention complexity by leveraging the Non-Uniform Fourier Transform, and generalize existing RPE-attention techniques from homogeneous grid settings to arbitrary heterogeneous 3D position distributions such as point clouds.

What carries the argument

The Non-Uniform Fourier Transform (NU-FFT) applied to integrable modulation functions, which modulates attention scores for tokens at arbitrarily distributed 3D coordinates.

If this is right

Point-cloud modeling becomes directly feasible without forcing tokens onto uniform grids.
Attention for long 3D sequences remains practical because cost grows as O(L log L) rather than O(L squared).
Existing RPE methods for grid data extend automatically to heterogeneous 3D layouts.
Empirical tests on multiple 3D datasets show measurable quality gains from the flexible encodings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same NU-FFT construction could be tested in 2D or 4D settings to check whether the efficiency pattern generalizes beyond three dimensions.
Hybrid models that combine RelFlexformer blocks with other fast attention primitives might further reduce constants in practice.
Applications such as 3D scene reconstruction or molecular modeling could benefit from the removal of grid assumptions.
Error bounds on the NU-FFT approximation for specific families of modulation functions remain open for tighter analysis.

Load-bearing premise

Integrable modulation functions exist that can be evaluated via NU-FFT on arbitrary 3D positions without extra structure or approximation errors large enough to erase the claimed speed or quality gains.

What would settle it

Measure wall-clock attention time and downstream accuracy on a large irregular point-cloud dataset while scaling sequence length L; the claim holds if runtime stays linearithmic and accuracy does not drop below standard quadratic-attention baselines.

Figures

Figures reproduced from arXiv: 2605.10706 by Arijit Sehanobish, Avinava Dubey, Byeongchan Kim, Krzysztof Choromanski, Min-hwan Oh.

**Figure 2.** Figure 2: RelFlexformer Mask Behavior Analysis. Comparison of our proposed kernels against standard RBF and Laplace kernels. As the Euclidean distance between points increases, the kernel value decays smoothly, mimicking the behavior of Heat and Laplace kernels (dashed lines). The tight variance in the scatter highlights the stability of our projection method across varying spatial scales. Substituting this into the… view at source ↗

**Figure 3.** Figure 3: Effect of the quadrature size S per head across different datasets. The plots illustrate the performance trade-offs on NYU Depth v2 (left, mIoU%), ScanObjectNN v2 (middle, Overall Accuracy%), and SUN RGB-D (right, mIoU%) for λ = 2. the RPE mask M is blended from identity to full strength over the first N epochs via Mˆ = 1 + α (M − 1), α = 1 2 [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

read the original abstract

We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RelFlexformers extend efficient RPE attention to arbitrary 3D positions via NU-FFT, but the approximation error bounds for maintaining O(L log L) and quality gains need direct verification in the derivations.

read the letter

The main thing to know is that this paper defines RelFlexformers as a class of 3D transformers that apply arbitrary integrable modulation functions for relative positional encodings, using non-uniform FFT to keep attention at O(L log L) even when points sit at irregular locations in 3D space. It generalizes earlier grid-restricted RPE methods to heterogeneous point distributions, with the explicit goal of handling point clouds and similar data.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces RelFlexformers, a new class of 3D-Transformer models that integrate arbitrary integrable relative positional encodings (RPEs) defined by modulation functions f via the Non-Uniform Fast Fourier Transform (NU-FFT). This yields O(L log L) attention complexity for L-length sequences with arbitrarily distributed 3D token positions, generalizing prior efficient RPE methods from homogeneous grid settings to heterogeneous unstructured point clouds, with empirical quality gains demonstrated on a portfolio of 3D datasets.

Significance. If the NU-FFT approximation errors can be rigorously bounded to preserve both the stated complexity and performance, the work would meaningfully advance efficient attention for unstructured 3D data by unifying and extending grid-based RPE techniques into a flexible framework applicable to point clouds. The manuscript earns credit for grounding the approach in established NU-FFT theory and for conducting extensive empirical evaluation across multiple 3D datasets.

major comments (2)

[§3] §3 (NU-FFT-based RPE integration): The central claim of exact O(L log L) complexity for arbitrary 3D position distributions rests on representing integrable modulation functions f via NU-FFT, yet the manuscript provides no explicit error bounds or analysis for the approximation parameters (oversampling factor, kernel support, grid size) specific to these f; this is load-bearing for the generalization from structured grids to heterogeneous scenarios and for the claimed complexity-quality tradeoff.
[§5] §5 (Empirical evaluation): The reported quality improvements lack ablation or sensitivity analysis on the NU-FFT approximation parameters, leaving open whether the gains hold under varying point densities or when error is controlled to machine precision, which directly affects the practical validity of the heterogeneous 3D claims.

minor comments (3)

[§2] The definition and integrability conditions on the modulation function f are introduced late; an explicit early statement with an example would improve readability.
[Figure 2] Figure 2 (attention visualization): The caption does not specify the exact NU-FFT parameters used, making it difficult to reproduce the depicted modulation effects.
[§1] Several citations to prior grid-based RPE methods (e.g., in §1) could be expanded with direct complexity comparisons to highlight the precise generalization achieved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important aspects of rigor in the theoretical claims and empirical validation. We address each point below and outline revisions that will strengthen the manuscript while preserving its core contributions on flexible 3D RPE via NU-FFT.

read point-by-point responses

Referee: [§3] §3 (NU-FFT-based RPE integration): The central claim of exact O(L log L) complexity for arbitrary 3D position distributions rests on representing integrable modulation functions f via NU-FFT, yet the manuscript provides no explicit error bounds or analysis for the approximation parameters (oversampling factor, kernel support, grid size) specific to these f; this is load-bearing for the generalization from structured grids to heterogeneous scenarios and for the claimed complexity-quality tradeoff.

Authors: We agree that a dedicated error analysis is necessary to fully support the generalization to heterogeneous 3D positions. The NU-FFT framework (as established in the literature, e.g., Greengard et al.) provides general error bounds controlled by the oversampling factor, kernel support width, and grid resolution, which can be made arbitrarily small for integrable f without changing the asymptotic O(L log L) complexity. However, the manuscript does not tailor these bounds explicitly to the modulation functions f used in RelFlexformers. In the revision, we will add a new subsection in §3 that (i) recalls the relevant NU-FFT error theorems, (ii) derives parameter-dependent bounds specific to the integrable f considered (including examples for common RPE kernels), and (iii) discusses how the approximation error trades off against the claimed complexity for unstructured point clouds. This will make the load-bearing assumptions explicit. revision: yes
Referee: [§5] §5 (Empirical evaluation): The reported quality improvements lack ablation or sensitivity analysis on the NU-FFT approximation parameters, leaving open whether the gains hold under varying point densities or when error is controlled to machine precision, which directly affects the practical validity of the heterogeneous 3D claims.

Authors: We acknowledge that the current empirical section does not include sensitivity analysis on the NU-FFT hyperparameters. To close this gap, the revised §5 will incorporate additional ablation studies that (i) vary the oversampling factor, kernel support, and grid size across the portfolio of 3D datasets (including point clouds with heterogeneous densities), (ii) report performance when the NU-FFT approximation error is driven to near machine precision, and (iii) compare against the default parameters used in the main results. These experiments will demonstrate that the observed quality gains are robust and not artifacts of particular approximation settings, thereby reinforcing the practical validity of the heterogeneous 3D claims. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external NU-FFT theory for O(L log L) complexity

full rationale

The paper defines RelFlexformers via application of integrable modulation functions f through NU-FFT to achieve O(L log L) attention for arbitrary 3D point distributions. This builds directly on established non-uniform FFT properties (external to the paper) rather than re-deriving or fitting them from the model's own outputs. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation forms the load-bearing uniqueness argument, and the generalization from grid RPEs is presented as an extension using prior NU-FFT results. The abstract and description contain no self-referential definitions or renamings that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the integrability of the modulation functions and the applicability of NU-FFT to arbitrary 3D position distributions; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Modulation functions f are integrable
Explicitly required for the universal 3D RPE method described in the abstract.
domain assumption NU-FFT can be applied to arbitrarily distributed token positions in 3D without loss of the stated complexity
Central to generalizing from grid-based to heterogeneous scenarios.

pith-pipeline@v0.9.0 · 5499 in / 1341 out tokens · 39758 ms · 2026-05-12T05:21:30.596564+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages · 2 internal anchors

[1]

3d semantic parsing of large-scale indoor spaces

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016

work page 2016
[2]

Point convolutional neural networks by extension operators.arXiv preprint arXiv:1803.10091, 2018

Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators.arXiv preprint arXiv:1803.10091, 2018

work page arXiv 2018
[3]

Multimae: Multi-modal multi-task masked autoencoders

Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean conference on computer vision, pages 348–367. Springer, 2022

work page 2022
[4]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document trans- former.arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

work page 2020
[6]

Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation

Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7088–7097, 2021

work page 2021
[7]

Spatial information guided convolution for real-time rgbd semantic segmentation.IEEE Transactions on Image Processing, 30:2313–2324, 2021

Lin-Zhuo Chen, Zheng Lin, Ziqin Wang, Yong-Liang Yang, and Ming-Ming Cheng. Spatial information guided convolution for real-time rgbd semantic segmentation.IEEE Transactions on Image Processing, 30:2313–2324, 2021. 10

work page 2021
[8]

Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation

Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, and Gang Zeng. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. InEuropean conference on computer vision, pages 561–577. Springer, 2020

work page 2020
[9]

Scaling up kernels in 3d cnns.arXiv preprint arXiv:2206.10555, 1(2):5, 2022

Yukang Chen, Jianhui Liu, Xiaojuan Qi, Xiangyu Zhang, Jian Sun, and Jiaya Jia. Scaling up kernels in 3d cnns.arXiv preprint arXiv:2206.10555, 1(2):5, 2022

work page arXiv 2022
[10]

A unified point- based framework for 3d segmentation

Hung-Yueh Chiang, Yen-Liang Lin, Yueh-Cheng Liu, and Winston H Hsu. A unified point- based framework for 3d segmentation. In2019 International Conference on 3D Vision (3DV), pages 155–163. IEEE, 2019

work page 2019
[11]

From block-toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers

Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamás Sarlós, Adrian Weller, and Thomas Weingarten. From block-toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked transformers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Sz...

work page 2022
[12]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

work page 2021
[13]

Fast tree-field integra- tors: From low displacement rank to topological transformers

Krzysztof Marcin Choromanski, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Han Lin, Kumar Avinava Dubey, Tamas Sarlos, and Snigdha Chaturvedi. Fast tree-field integra- tors: From low displacement rank to topological transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[14]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019

work page 2019
[15]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Fast kernel methods: Sobolev, physics-informed, and additive models, 2025

Nathan Doumèche, Francis Bach, Gérard Biau, and Claire Boyer. Fast kernel methods: Sobolev, physics-informed, and additive models, 2025

work page 2025
[18]

Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d se- mantic segmentation

Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, and Shengjun Tang. Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d se- mantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7608–7615, 2024

work page 2024
[19]

American Mathematical Soc., 2009

Gerald B Folland.Fourier analysis and its applications, volume 4. American Mathematical Soc., 2009

work page 2009
[20]

Omnivore: A single model for many visual modalities

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A single model for many visual modalities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16102–16112, 2022

work page 2022
[21]

3d semantic segmentation with submanifold sparse convolutional networks

Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9224–9232, 2018

work page 2018
[22]

Accelerating the nonuniform fast fourier transform.SIAM Review, 46(3):443–454, 2004

Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast fourier transform.SIAM Review, 46(3):443–454, 2004. 11

work page 2004
[23]

Martin, and Shi-Min Hu

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational Visual Media, 7(2):187–199, Apr 2021

work page 2021
[24]

Learning rich features from rgb-d images for object detection and segmentation

Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. InEuropean conference on computer vision, pages 345–360. Springer, 2014

work page 2014
[25]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations, 2021

work page 2021
[26]

Point-to-voxel knowledge distillation for lidar semantic segmentation

Yuenan Hou, Xinge Zhu, Yuexin Ma, Chen Change Loy, and Yikang Li. Point-to-voxel knowledge distillation for lidar semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

work page 2022
[27]

Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation

Xinxin Hu, Kailun Yang, Lei Fei, and Kaiwei Wang. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In2019 IEEE international conference on image processing (ICIP), pages 1440–1444. IEEE, 2019

work page 2019
[28]

Fourier position embedding: enhancing atten- tion’s periodic extension for length generalization

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: enhancing atten- tion’s periodic extension for length generalization. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025

work page 2025
[29]

Hierarchical point-edge interaction network for point cloud semantic segmentation

Li Jiang, Hengshuang Zhao, Shu Liu, Xiaoyong Shen, Chi-Wing Fu, and Jiaya Jia. Hierarchical point-edge interaction network for point cloud semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[30]

Transformers are rnns: fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

work page 2020
[31]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

work page 2020
[32]

Rethinking range view representation for lidar segmentation

Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei Liu. Rethinking range view representation for lidar segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[33]

Spherical transformer for lidar-based 3d recognition

Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

work page 2023
[34]

Stratified transformer for 3d point cloud segmentation

Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified transformer for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022

work page 2022
[35]

Large-scale point cloud semantic segmentation with superpoint graphs

Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

work page 2018
[36]

Pointgrid: A deep network for 3d shape understanding

Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9204–9214, 2018

work page 2018
[37]

Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel

Huan Lei, Naveed Akhtar, and Ajmal Mian. Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

work page 2020
[38]

Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018

Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems, 31, 2018. 12

work page 2018
[39]

Pamba: enhancing global interaction in point clouds via state space model

Zhuoyuan Li, Yubo Ai, Jiahao Lu, ChuXin Wang, Jiacheng Deng, Hanzhi Chang, Yanzhe Liang, Wenfei Yang, Shifeng Zhang, and Tianzhu Zhang. Pamba: enhancing global interaction in point clouds via state space model. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5092–5100, 2025

work page 2025
[40]

Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677, 2024

work page 2024
[41]

Meta architecture for point cloud analysis

Haojia Lin, Xiawu Zheng, Lijiang Li, Fei Chao, Shanshan Wang, Yan Wang, Yonghong Tian, and Rongrong Ji. Meta architecture for point cloud analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17682–17691, 2023

work page 2023
[42]

Masked discrimination for self-supervised learning on point clouds

Haotian Liu, Mu Cai, and Yong Jae Lee. Masked discrimination for self-supervised learning on point clouds. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

work page 2022
[43]

Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy.arXiv preprint arXiv:2403.06467, 2024

Jiuming Liu, Ruiji Yu, Yian Wang, Yu Zheng, Tianchen Deng, Weicai Ye, and Hesheng Wang. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy.arXiv preprint arXiv:2403.06467, 2024

work page arXiv 2024
[44]

Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network

Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. InProceedings of the AAAI conference on artificial intelligence, pages 8778–8785, 2019

work page 2019
[45]

Relation-shape convolutional neural network for point cloud analysis

Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8895–8904, 2019

work page 2019
[46]

Multi-space alignments towards universal lidar segmentation

Youquan Liu, Lingdong Kong, Xiaoyang Wu, Runnan Chen, Xin Li, Liang Pan, Ziwei Liu, and Yuexin Ma. Multi-space alignments towards universal lidar segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[47]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[48]

Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

Dening Lu, Qian Xie, Mingqiang Wei, Kyle Gao, Linlin Xu, and Jonathan Li. Transformers in 3d point clouds: A survey.arXiv preprint arXiv:2205.07417, 2022

work page arXiv 2022
[49]

Stable, fast and accurate: Kernelized attention with relative positional encoding

Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, and Tie-Yan Liu. Stable, fast and accurate: Kernelized attention with relative positional encoding. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22795–...

work page 2021
[50]

arXiv preprint arXiv:2202.07123 , year=

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

work page arXiv 2022
[51]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

work page 2012
[52]

Masked autoencoders for 3d point cloud self-supervised learning.World Scientific Annual Review of Artificial Intelligence, 1:2440001, 2023

Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self-supervised learning.World Scientific Annual Review of Artificial Intelligence, 1:2440001, 2023

work page 2023
[53]

Oa-cnns: Omni-adaptive sparse cnns for 3d semantic segmentation

Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Hengshuang Zhao, Zhuotao Tian, and Jiaya Jia. Oa-cnns: Omni-adaptive sparse cnns for 3d semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

work page 2024
[54]

Pointcept: A codebase for point cloud perception research

Pointcept Contributors. Pointcept: A codebase for point cloud perception research. https: //github.com/Pointcept/Pointcept, 2023. 13

work page 2023
[55]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022

work page 2022
[56]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017

work page 2017
[57]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

work page 2017
[58]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems, 35:23192–23204, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems, 35:23192–23204, 2022

work page 2022
[59]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020

work page 2020
[60]

Turner, René Wagner, Adrian Weller, and Krzysztof Marcin Choromanski

Isaac Reid, Kumar Avinava Dubey, Deepali Jain, William F Whitney, Amr Ahmed, Joshua Ainslie, Alex Bewley, Mithun George Jacob, Aranyak Mehta, David Rendleman, Connor Schenck, Richard E. Turner, René Wagner, Adrian Weller, and Krzysztof Marcin Choromanski. Linear transformer topological masking with graph random features. InThe Thirteenth International Con...

work page 2025
[61]

Efficient 3d semantic segmentation with su- perpoint transformer

Damien Robert, Hugo Raguet, and Loic Landrieu. Efficient 3d semantic segmentation with su- perpoint transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[62]

Language-grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language-grounded indoor 3d semantic segmentation in the wild. InEuropean conference on computer vision, pages 125–141. Springer, 2022

work page 2022
[63]

Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, and Krzysztof Marcin Choromanski

Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Kumar Avinava Dubey, Ayzaan Wahid, Sumeet Singh, René Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey A. Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, Vikas Sindhwani, and Krzysztof Marcin...

work page 2025
[64]

Efficient multi-task rgb-d scene analysis for indoor environments

Daniel Seichter, Söhnke Benedikt Fischedick, Mona Köhler, and Horst-Michael Groß. Efficient multi-task rgb-d scene analysis for indoor environments. In2022 International joint conference on neural networks (IJCNN), pages 1–10. IEEE, 2022

work page 2022
[65]

Efficient rgb-d semantic segmentation for indoor scene analysis

Daniel Seichter, Mona Köhler, Benjamin Lewandowski, Tim Wengefeld, and Horst-Michael Gross. Efficient rgb-d semantic segmentation for indoor scene analysis. In2021 IEEE international conference on robotics and automation (ICRA), pages 13525–13531. IEEE, 2021

work page 2021
[66]

Self-attention with relative position representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Lo...

work page 2018
[67]

Sun rgb-d: A rgb-d scene under- standing benchmark suite

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene under- standing benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

work page 2015
[68]

Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. 14

work page 2024
[69]

Searching efficient 3d architectures with sparse point-voxel convolution

Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architectures with sparse point-voxel convolution. InEuropean confer- ence on computer vision, 2020

work page 2020
[70]

Tangent convolutions for dense prediction in 3d

Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

work page 2018
[71]

Segcloud: Semantic segmentation of 3d point clouds

Lyne Tchapmi, Christopher Choy, Iro Armeni, Jun Young Gwak, and Silvio Savarese. Segcloud: Semantic segmentation of 3d point clouds. InProceedings of the International Conference on 3D Vision (3DV), 2017

work page 2017
[72]

Kpconv: Flexible and deformable convolution for point clouds

Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019

work page 2019
[73]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. InProceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019

work page 2019
[74]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc

work page 2017
[75]

Graph attention convo- lution for point cloud semantic segmentation

Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convo- lution for point cloud semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019

work page 2019
[76]

Octformer: Octree-based transformers for 3d point clouds.ACM Transac- tions on Graphics (TOG), 42(4):1–11, 2023

Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds.ACM Transac- tions on Graphics (TOG), 42(4):1–11, 2023

work page 2023
[77]

Deep parametric continuous convolutional neural networks

Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, 2018

work page 2018
[78]

Multimodal token fusion for vision transformers

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. Multimodal token fusion for vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12186–12195, 2022

work page 2022
[79]

Deep multimodal fusion by channel exchanging.Advances in neural information processing systems, 33:4835–4845, 2020

Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. Deep multimodal fusion by channel exchanging.Advances in neural information processing systems, 33:4835–4845, 2020

work page 2020
[80]

Sarma, Michael M

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph cnn for learning on point clouds. InACM Transactions on Graphics (SIGGRAPH Asia) 38(5), pages 146:1–146:12, 2019

work page 2019

Showing first 80 references.