Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

Haowen Gu; Jingzhi Dong; Tao Chen; Xiaoshui Huang; Yazhou Yao; Yumeng Yao; Zonghan Wu

arxiv: 2605.17566 · v1 · pith:PRHEKDRMnew · submitted 2026-05-17 · 💻 cs.CV

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

Yumeng Yao , Jingzhi Dong , Haowen Gu , Tao Chen , Zonghan Wu , Xiaoshui Huang , Yazhou Yao This is my paper

Pith reviewed 2026-05-20 13:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords point cloud pre-trainingself-supervised learningnext-token predictioncausal transformer3D point cloudsserializationlatent prediction

0 comments

The pith

Point clouds can be effectively pre-trained by causal next-token prediction on geometry-serialized patch sequences without reconstruction decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that point cloud self-supervised learning can be reformulated as a causal next-token prediction task in latent space. It does so by first dividing a point cloud into local patches and ordering them into a sequence based on the positions of their centers. A causal Transformer is then trained to predict subsequent tokens using only the preceding context, with the objective stabilized by stop-gradient mechanisms. This decoder-free design learns 3D structural dependencies directly. A sympathetic reader would care because it aligns 3D pre-training with the successful predictive learning paradigm from language models, offering a simpler and potentially more scalable approach compared to methods that rely on masked reconstruction or explicit generation.

Core claim

The core discovery is the PointNTP framework, which models point clouds as sequences for fully causal, decoder-free latent next-token prediction. Patches are serialized according to patch-center geometry, and a causal Transformer under prefix-only conditioning is trained with a shift-based prediction objective. This setup enables the model to capture structural dependencies in latent space without any reconstruction components, leading to strong performance on downstream 3D tasks.

What carries the argument

The key machinery is the geometry-based serialization of point patches into a token sequence combined with prefix-only causal Transformer modeling for next-token prediction in latent space.

If this is right

It achieves 93.8% accuracy on OBJ_BG, 92.6% on OBJ_ONLY, and 89.3% on PB_T50_RS of ScanObjectNN.
It reaches 85.0% Cls.mIoU on ShapeNetPart segmentation.
It obtains 71.1% mAcc on S3DIS Area 5 semantic segmentation.
The approach demonstrates that predictive dependency modeling can serve as an alternative to input recovery in point cloud pre-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might generalize to other modalities if a suitable serialization strategy is developed for them.
Alternative patch ordering criteria could be tested to see if they improve capture of 3D structure.
Joint pre-training with other data types like images or text could become feasible under this unified causal prediction framework.

Load-bearing premise

The assumption that ordering patches by the geometry of their centers creates a sequence in which causal dependencies reflect the key 3D structural information required for good performance on downstream tasks.

What would settle it

If randomizing the patch order or using a non-causal bidirectional model produces comparable downstream performance, this would indicate that the specific causal serialization and prediction setup is not essential to the results.

Figures

Figures reproduced from arXiv: 2605.17566 by Haowen Gu, Jingzhi Dong, Tao Chen, Xiaoshui Huang, Yazhou Yao, Yumeng Yao, Zonghan Wu.

**Figure 2.** Figure 2: Overall architecture of PointNTP. PointNTP reformulates point-cloud self-supervised pre-training as a fully causal, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PointNTP, a self-supervised pre-training framework for 3D point clouds that recasts the task as decoder-free causal next-token prediction in latent space. Each point cloud is partitioned into local patches, serialized into a token sequence ordered by patch-center geometry, and modeled by a causal Transformer under prefix-only conditioning with a shift-based prediction loss stabilized by stop-gradient targets. The method is evaluated on classification (ScanObjectNN), part segmentation (ShapeNetPart), and semantic segmentation (S3DIS), reporting competitive accuracies such as 93.8% on OBJ_BG and 71.1% mAcc on S3DIS Area 5.

Significance. If the results are robust, the work supplies a simple, scalable alternative to masked-reconstruction or explicit geometric-generation approaches, aligning 3D self-supervision with the next-token predictive paradigm used in language and 2D foundation models. The decoder-free design and the claim of potential modality-agnostic applicability are clear strengths that could influence future 3D foundation-model research.

major comments (2)

[§3.2] §3.2 (Serialization procedure): The central claim that ordering patches by the geometry of their centers yields sequences whose causal dependencies capture essential 3D structure is load-bearing for the assertion of a 'principled causal paradigm.' No ablation compares alternative orderings (lexicographic, Morton, surface-based, etc.), nor is there analysis showing that prefix tokens provide geometrically meaningful context for later tokens. Standard center-based sorts can place structurally adjacent patches far apart in the linear order, weakening the justification that prefix-only conditioning learns useful 3D dependencies rather than arbitrary sequence statistics.
[§4] §4 (Experiments): The reported gains (e.g., +0.5% on OBJ_BG, +1.1% on PB_T50_RS) are presented without error bars, standard deviations over random seeds, or verification across multiple data splits. This absence makes it impossible to determine whether the improvements are statistically reliable or sensitive to implementation choices, directly affecting confidence in the empirical support for the proposed paradigm.

minor comments (2)

[§3.3] The stop-gradient target mechanism is described only in prose; a short pseudocode block or diagram in §3.3 would improve reproducibility.
[§3.1] Notation for the latent token sequence and the shift-based objective could be introduced with an explicit equation rather than inline text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Serialization procedure): The central claim that ordering patches by the geometry of their centers yields sequences whose causal dependencies capture essential 3D structure is load-bearing for the assertion of a 'principled causal paradigm.' No ablation compares alternative orderings (lexicographic, Morton, surface-based, etc.), nor is there analysis showing that prefix tokens provide geometrically meaningful context for later tokens. Standard center-based sorts can place structurally adjacent patches far apart in the linear order, weakening the justification that prefix-only conditioning learns useful 3D dependencies rather than arbitrary sequence statistics.

Authors: We appreciate the referee's point that empirical support for the geometric serialization is important to substantiate the causal paradigm. While the ordering is chosen to respect spatial proximity in 3D, we agree that direct comparisons are needed. In the revised manuscript we will add an ablation study comparing our patch-center geometric ordering to lexicographic, Morton, and random orderings on the ScanObjectNN classification task. We will also include a brief analysis (e.g., attention-map inspection and a simple prefix-to-token geometric correlation metric) to illustrate that earlier tokens in the sequence provide spatially relevant context for later tokens. These additions will strengthen the justification for the chosen serialization. revision: yes
Referee: [§4] §4 (Experiments): The reported gains (e.g., +0.5% on OBJ_BG, +1.1% on PB_T50_RS) are presented without error bars, standard deviations over random seeds, or verification across multiple data splits. This absence makes it impossible to determine whether the improvements are statistically reliable or sensitive to implementation choices, directly affecting confidence in the empirical support for the proposed paradigm.

Authors: We agree that reporting variability is essential for assessing the reliability of the results. In the revised version we will rerun the main experiments on ScanObjectNN, ShapeNetPart, and S3DIS using at least five different random seeds and report mean accuracy together with standard deviation. We will also evaluate performance on an additional data split of ScanObjectNN where feasible. These changes will allow readers to better judge the statistical robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new serialization + standard causal objective applied to point clouds

full rationale

The paper constructs a pipeline by partitioning point clouds into patches, ordering them via patch-center geometry to form a token sequence, and then training a causal Transformer with a standard shift-based next-token prediction loss in latent space. This is an application of an existing decoder-free causal objective (not derived from the paper's fitted values or self-defined) to a newly proposed 3D serialization. No load-bearing step reduces by construction to its own inputs, self-citation chains, or renamed known results; the central claim that this learns useful structural dependencies is supported by downstream task results rather than tautological equivalence. The geometric ordering choice is an explicit modeling assumption, not a circular definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework relies on standard Transformer causal masking and point-cloud patch extraction; no new physical constants or entities are introduced. The main design choice is the serialization order, which functions as an implicit modeling assumption rather than a fitted parameter.

free parameters (1)

patch partitioning granularity
Number and size of local patches determine sequence length and are chosen by the authors to balance context and compute.

axioms (1)

domain assumption Ordering patches by center coordinates produces a sequence whose causal statistics reflect 3D geometric structure
Invoked when the paper states that the serialized sequence enables the model to learn structural dependencies directly.

pith-pipeline@v0.9.0 · 5856 in / 1278 out tokens · 37737 ms · 2026-05-20T13:15:02.274528+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry... Hilbert-style scans... argsort(H_o(c1),...,H_o(cG))
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lnext = 1 - 1/(T-1) sum <ẑt/||ẑt||, sg(zt+1)/||zt+1||>

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. 2022. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9902–9912

work page 2022
[2]

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1534–1543

work page 2016
[3]

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15619–15629

work page 2023
[4]

George Bredis, Nikita Balagansky, Daniil Gavrilov, and Ruslan Rakhimov. 2026. Next Embedding Prediction Makes World Models Stronger.arXiv preprint arXiv:2603.02765(2026)

work page arXiv 2026
[5]

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qix- ing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. 2023. Pointgpt: Auto-regressively generative pre-training from point clouds.Advances in Neural Information Processing Systems36 (2023), 29667–29679

work page 2023
[7]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative Pretraining From Pixels. InProceedings of the 37th International Conference on Machine Learning. 1691–1703

work page 2020
[8]

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. 2023. Clip2scene: Towards label- efficient 3d scene understanding by clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020–7030

work page 2023
[9]

Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15750–15758

work page 2021
[10]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839

work page 2017
[11]

Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, and Kaisheng Ma. 2023. Autoencoders as cross-modal teachers: Can pre- trained 2d image transformers help 3d representation learning?. InThe Eleventh International Conference on Learning Representations

work page 2023
[12]

Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. 2020. Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence43, 12 (2020), 4338–4364

work page 2020
[13]

Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzhi Li, and Pheng-Ann Heng. 2023. Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 791–799

work page 2023
[14]

Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. 2021. Mvtn: Multi-view transformation network for 3d shape recognition. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

work page 2021
[15]

Cheng-Yao Hong, Yu-Ying Chou, and Tyng-Luh Liu. 2023. Attention discriminant sampling for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14429–14440

work page 2023
[16]

Di Huang, Sida Peng, Tong He, Honghui Yang, Xiaowei Zhou, and Wanli Ouyang

work page
[17]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Ponder: Point cloud pre-training via neural rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision. 16089–16098

work page
[18]

Longlong Jing, Yucheng Chen, Ling Zhang, Mingyi He, and Yingli Tian. 2020. Self-supervised modal and view invariant feature learning.arXiv preprint arXiv:2005.14169(2020)

work page arXiv 2020
[19]

Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems31 (2018)

work page 2018
[20]

Zhe Li, Zhangyang Gao, Cheng Tan, Bocheng Ren, Laurence T Yang, and Stan Z Li. 2024. General point model pretraining with autoencoding and autoregres- sive. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20954–20964

work page 2024
[21]

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. 2024. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems37 (2024), 32653–32677

work page 2024
[22]

Haotian Liu, Mu Cai, and Yong Jae Lee. 2022. Masked discrimination for self- supervised learning on point clouds. InEuropean Conference on Computer Vision. Springer, 657–675

work page 2022
[23]

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In International Conference on Learning Representations

work page 2022
[24]

Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, and Li Yuan. 2022. Masked Autoencoders for Point Cloud Self-supervised Learning. In European Conference on Computer Vision. Springer, 604–621

work page 2022
[25]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Rep- resentations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computat...

work page 2018
[26]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition. 652–660

work page 2017
[27]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems30 (2017)

work page 2017
[28]

Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, and Li Yi. 2023. Contrast with Reconstruct: Contrastive 3D Representation Learn- ing Guided by Generative Pretraining. InProceedings of the 40th International Conference on Machine Learning. 28223–28243

work page 2023
[29]

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mo- hamed Elhoseiny, and Bernard Ghanem. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems35 (2022), 23192–23204

work page 2022
[30]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . 2018. Improving language understanding by generative pre-training. (2018)

work page 2018
[31]

Haoxi Ran, Jun Liu, and Chengjie Wang. 2022. Surface representation for point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18942–18952

work page 2022
[32]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

work page 2024
[33]

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. 2019. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. InProceedings of the IEEE/CVF international conference on computer vision. 1588–1597

work page 2019
[34]

Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and Matt J Kusner. 2021. Unsupervised point cloud pre-training via occlusion completion. InProceedings of the IEEE/CVF international conference on computer vision. 9782–9792

work page 2021
[35]

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog)38, 5 (2019), 1–12

work page 2019
[36]

Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou, and Jiwen Lu. 2024. Point-to- pixel prompting for point cloud analysis with pre-trained image models.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 6 (2024), 4381–4397

work page 2024
[37]

Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, and Jiwen Lu. 2022. P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. Advances in neural information processing systems35 (2022), 14388–14402

work page 2022
[38]

Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, and Jiwen Lu. 2023. Take-a- photo: 3d-to-2d generative pre-training of point cloud models. InProceedings of the IEEE/CVF international conference on computer vision. 5640–5650

work page 2023
[39]

Ziyi Wang, Yanran Zhang, Jie Zhou, and Jiwen Lu. 2025. Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference. 1319– 1329

work page 2025
[40]

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany

work page
[41]

InEuropean conference on computer vision

Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean conference on computer vision. Springer, 574–591

work page
[42]

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, and Stella X Yu. 2025. Next-Embedding Prediction Makes Strong Vision Learners.arXiv preprint arXiv:2512.16922(2025)

work page arXiv 2025
[43]

Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. Ulip: Learning a unified representation of language, images, and point clouds for 3d understand- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1179–1189

work page 2023
[44]

Siming Yan, Zhenpei Yang, Haoxiang Li, Chen Song, Li Guan, Hao Kang, Gang Hua, and Qixing Huang. 2023. Implicit autoencoder for point-cloud self- supervised representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14530–14542

work page 2023
[45]

Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. 2016. A scalable active framework for region annotation in 3d shape collections.ACM Transactions on Graphics (ToG)35, 6 (2016), 1–12

work page 2016
[46]

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu

work page
[47]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19313–19322

work page
[48]

Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, and Shu-Tao Xia. 2025. Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised 9 Learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 2332–2340

work page 2025
[49]

Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. 2024. Towards compact 3d representations via point feature enhancement masked autoencoders. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6962–6970

work page 2024
[50]

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. 2023. Instance-aware dynamic prompt tuning for pre-trained point cloud models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14161– 14170

work page 2023
[51]

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. 2022. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems35 (2022), 27061–27074

work page 2022
[52]

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. 2023. Learn- ing 3d representations from 2d pre-trained models via image-to-point masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21769–21780

work page 2023
[53]

Xiangdong Zhang, Shaofeng Zhang, and Junchi Yan. 2024. Pcp-mae: Learning to predict centers for point masked autoencoders.Advances in Neural Information Processing Systems37 (2024), 80303–80327

work page 2024
[54]

Xiangdong Zhang, Shaofeng Zhang, and Junchi Yan. 2025. Towards More Di- verse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views. InProceedings of the IEEE/CVF International Conference on Computer Vision. 28696–28706

work page 2025
[55]

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan Misra. 2021. Self- supervised pretraining of 3d features on any point-cloud. InProceedings of the IEEE/CVF international conference on computer vision. 10252–10263

work page 2021
[56]

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. 2021. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision. 16259–16268

work page 2021
[57]

Xiao Zheng, Xiaoshui Huang, Guofeng Mei, Yuenan Hou, Zhaoyang Lyu, Bo Dai, Wanli Ouyang, and Yongshun Gong. 2024. Point cloud pre-training with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22935–22945

work page 2024
[58]

Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, and Xiang Bai. 2024. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 14707–14717. 10

work page 2024

[1] [1]

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. 2022. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9902–9912

work page 2022

[2] [2]

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3d semantic parsing of large-scale indoor spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1534–1543

work page 2016

[3] [3]

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15619–15629

work page 2023

[4] [4]

George Bredis, Nikita Balagansky, Daniil Gavrilov, and Ruslan Rakhimov. 2026. Next Embedding Prediction Makes World Models Stronger.arXiv preprint arXiv:2603.02765(2026)

work page arXiv 2026

[5] [5]

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qix- ing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. 2023. Pointgpt: Auto-regressively generative pre-training from point clouds.Advances in Neural Information Processing Systems36 (2023), 29667–29679

work page 2023

[7] [7]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative Pretraining From Pixels. InProceedings of the 37th International Conference on Machine Learning. 1691–1703

work page 2020

[8] [8]

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. 2023. Clip2scene: Towards label- efficient 3d scene understanding by clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020–7030

work page 2023

[9] [9]

Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15750–15758

work page 2021

[10] [10]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839

work page 2017

[11] [11]

Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, and Kaisheng Ma. 2023. Autoencoders as cross-modal teachers: Can pre- trained 2d image transformers help 3d representation learning?. InThe Eleventh International Conference on Learning Representations

work page 2023

[12] [12]

Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. 2020. Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence43, 12 (2020), 4338–4364

work page 2020

[13] [13]

Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzhi Li, and Pheng-Ann Heng. 2023. Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 791–799

work page 2023

[14] [14]

Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. 2021. Mvtn: Multi-view transformation network for 3d shape recognition. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

work page 2021

[15] [15]

Cheng-Yao Hong, Yu-Ying Chou, and Tyng-Luh Liu. 2023. Attention discriminant sampling for point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14429–14440

work page 2023

[16] [16]

Di Huang, Sida Peng, Tong He, Honghui Yang, Xiaowei Zhou, and Wanli Ouyang

work page

[17] [17]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Ponder: Point cloud pre-training via neural rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision. 16089–16098

work page

[18] [18]

Longlong Jing, Yucheng Chen, Ling Zhang, Mingyi He, and Yingli Tian. 2020. Self-supervised modal and view invariant feature learning.arXiv preprint arXiv:2005.14169(2020)

work page arXiv 2020

[19] [19]

Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. Pointcnn: Convolution on x-transformed points.Advances in neural information processing systems31 (2018)

work page 2018

[20] [20]

Zhe Li, Zhangyang Gao, Cheng Tan, Bocheng Ren, Laurence T Yang, and Stan Z Li. 2024. General point model pretraining with autoencoding and autoregres- sive. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20954–20964

work page 2024

[21] [21]

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. 2024. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems37 (2024), 32653–32677

work page 2024

[22] [22]

Haotian Liu, Mu Cai, and Yong Jae Lee. 2022. Masked discrimination for self- supervised learning on point clouds. InEuropean Conference on Computer Vision. Springer, 657–675

work page 2022

[23] [23]

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In International Conference on Learning Representations

work page 2022

[24] [24]

Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, and Li Yuan. 2022. Masked Autoencoders for Point Cloud Self-supervised Learning. In European Conference on Computer Vision. Springer, 604–621

work page 2022

[25] [25]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Rep- resentations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computat...

work page 2018

[26] [26]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition. 652–660

work page 2017

[27] [27]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems30 (2017)

work page 2017

[28] [28]

Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, and Li Yi. 2023. Contrast with Reconstruct: Contrastive 3D Representation Learn- ing Guided by Generative Pretraining. InProceedings of the 40th International Conference on Machine Learning. 28223–28243

work page 2023

[29] [29]

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mo- hamed Elhoseiny, and Bernard Ghanem. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural information processing systems35 (2022), 23192–23204

work page 2022

[30] [30]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . 2018. Improving language understanding by generative pre-training. (2018)

work page 2018

[31] [31]

Haoxi Ran, Jun Liu, and Chengjie Wang. 2022. Surface representation for point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18942–18952

work page 2022

[32] [32]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

work page 2024

[33] [33]

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. 2019. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. InProceedings of the IEEE/CVF international conference on computer vision. 1588–1597

work page 2019

[34] [34]

Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and Matt J Kusner. 2021. Unsupervised point cloud pre-training via occlusion completion. InProceedings of the IEEE/CVF international conference on computer vision. 9782–9792

work page 2021

[35] [35]

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog)38, 5 (2019), 1–12

work page 2019

[36] [36]

Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou, and Jiwen Lu. 2024. Point-to- pixel prompting for point cloud analysis with pre-trained image models.IEEE Transactions on Pattern Analysis and Machine Intelligence46, 6 (2024), 4381–4397

work page 2024

[37] [37]

Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, and Jiwen Lu. 2022. P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. Advances in neural information processing systems35 (2022), 14388–14402

work page 2022

[38] [38]

Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, and Jiwen Lu. 2023. Take-a- photo: 3d-to-2d generative pre-training of point cloud models. InProceedings of the IEEE/CVF international conference on computer vision. 5640–5650

work page 2023

[39] [39]

Ziyi Wang, Yanran Zhang, Jie Zhou, and Jiwen Lu. 2025. Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference. 1319– 1329

work page 2025

[40] [40]

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany

work page

[41] [41]

InEuropean conference on computer vision

Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean conference on computer vision. Springer, 574–591

work page

[42] [42]

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, and Stella X Yu. 2025. Next-Embedding Prediction Makes Strong Vision Learners.arXiv preprint arXiv:2512.16922(2025)

work page arXiv 2025

[43] [43]

Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. 2023. Ulip: Learning a unified representation of language, images, and point clouds for 3d understand- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1179–1189

work page 2023

[44] [44]

Siming Yan, Zhenpei Yang, Haoxiang Li, Chen Song, Li Guan, Hao Kang, Gang Hua, and Qixing Huang. 2023. Implicit autoencoder for point-cloud self- supervised representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14530–14542

work page 2023

[45] [45]

Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. 2016. A scalable active framework for region annotation in 3d shape collections.ACM Transactions on Graphics (ToG)35, 6 (2016), 1–12

work page 2016

[46] [46]

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu

work page

[47] [47]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19313–19322

work page

[48] [48]

Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, and Shu-Tao Xia. 2025. Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised 9 Learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 2332–2340

work page 2025

[49] [49]

Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. 2024. Towards compact 3d representations via point feature enhancement masked autoencoders. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6962–6970

work page 2024

[50] [50]

Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, and Shu-Tao Xia. 2023. Instance-aware dynamic prompt tuning for pre-trained point cloud models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14161– 14170

work page 2023

[51] [51]

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. 2022. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems35 (2022), 27061–27074

work page 2022

[52] [52]

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. 2023. Learn- ing 3d representations from 2d pre-trained models via image-to-point masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21769–21780

work page 2023

[53] [53]

Xiangdong Zhang, Shaofeng Zhang, and Junchi Yan. 2024. Pcp-mae: Learning to predict centers for point masked autoencoders.Advances in Neural Information Processing Systems37 (2024), 80303–80327

work page 2024

[54] [54]

Xiangdong Zhang, Shaofeng Zhang, and Junchi Yan. 2025. Towards More Di- verse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views. InProceedings of the IEEE/CVF International Conference on Computer Vision. 28696–28706

work page 2025

[55] [55]

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan Misra. 2021. Self- supervised pretraining of 3d features on any point-cloud. InProceedings of the IEEE/CVF international conference on computer vision. 10252–10263

work page 2021

[56] [56]

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. 2021. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision. 16259–16268

work page 2021

[57] [57]

Xiao Zheng, Xiaoshui Huang, Guofeng Mei, Yuenan Hou, Zhaoyang Lyu, Bo Dai, Wanli Ouyang, and Yongshun Gong. 2024. Point cloud pre-training with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22935–22945

work page 2024

[58] [58]

Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, and Xiang Bai. 2024. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 14707–14717. 10

work page 2024