pith. sign in

arxiv: 2604.15703 · v1 · submitted 2026-04-17 · 💻 cs.CV

P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

Pith reviewed 2026-05-10 09:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D point cloudsprompt tuningvision-language modelsparameter-efficient adaptationprototypical lossfew-shot learningcross-dataset generalization
0
0 comments X

The pith

A prompt tuning approach for pre-trained 3D vision-language models matches full fine-tuning performance while improving generalization under data shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a parameter-efficient method to adapt pre-trained models that process 3D point clouds paired with language descriptions. Instead of updating every model weight, it adds two small generators: one that creates prompts tailored to each individual point in the input cloud and another that replaces fixed text phrases with adjustable ones. A loss term based on category prototypes pulls embeddings of similar items closer together. Experiments show this setup performs at or above the level of complete model retraining on classification and few-shot tasks while holding up better when test data comes from a different distribution.

Core claim

P3T consists of a Point Prompter that produces instance-aware point-level prompts directly from the input point cloud and a Text Prompter that inserts learnable prompts into the text input, together with a prototypical loss that reduces intra-category variance to improve embedding alignment. This combination allows task-specific adaptation of 3D VLMs without full retraining, matching or exceeding full fine-tuning accuracy in classification and few-shot learning while demonstrating stronger robustness in cross-dataset evaluations.

What carries the argument

The Point Prompter generates instance-aware point-level prompts for each input point cloud and the Text Prompter replaces hand-crafted text with learnable prompts, with both supported by a prototypical loss that aligns embeddings by shrinking variance inside each category.

Load-bearing premise

That the combination of point-level prompts, learnable text prompts, and a prototypical loss will reduce intra-category variance and improve generalization without creating new overfitting modes or domain-specific biases missed by the experiments.

What would settle it

A controlled test on an unseen cross-dataset shift where P3T falls substantially below full fine-tuning accuracy on the target task.

Figures

Figures reproduced from arXiv: 2604.15703 by Geunyoung Jung, Jiyoung Jung, Kyungwoo Song, Soohong Kim.

Figure 1
Figure 1. Figure 1: Overview of the P3T framework. The upper part represents the 3D branch with a Point Prompter, and the lower part corresponds to the text branch with a Text Prompter. nificantly reducing the number of learnable parameters. To further assess whether generalizability is preserved after fine￾tuning, we conduct a cross-dataset generalization experiment. P 3T consistently shows strong generalization performance,… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of Point Prompter. It takes n patches of a point cloud, and generates a prompt token and offsets for both local points and center of each patch. The offsets are added to the target patches to create deformed patches, which are then concatenated to original patches. Farthest Point Sampling (FPS) and K-Nearest Neighbors (KNN), where each patch contains k local points. The following patch embeddi… view at source ↗
Figure 1
Figure 1. Figure 1: The text tokens of a fixed hand-crafted prompt are re [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Few-shot classification results on the ModelNet40 and PB [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The t-SNE visualization of the 3D embeddings from PB dataset [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Prototypical Point-level Prompt Tuning (P³T) as a parameter-efficient adaptation method for pre-trained 3D vision-language models. It introduces a Point Prompter that generates instance-aware point-level prompts directly from the input point cloud, a Text Prompter that replaces hand-crafted text prompts with learnable ones, and a prototypical loss to reduce intra-category variance and improve embedding alignment. The central empirical claim is that P³T matches or outperforms full fine-tuning on classification and few-shot learning tasks while exhibiting stronger generalization under data shift in cross-dataset evaluations.

Significance. If the reported performance and generalization results hold under rigorous verification, the work would be significant for efficient adaptation of large 3D VLMs in real-world applications where full fine-tuning is prohibitive. The emphasis on reducing overfitting via point-level and prototypical components, combined with the code release, could facilitate further research in parameter-efficient 3D prompt tuning.

major comments (2)
  1. Abstract and §4 (Experiments): The central claim that P³T matches or outperforms full fine-tuning and shows robust cross-dataset generalization is stated without any quantitative results, baseline comparisons, error bars, or ablation tables in the abstract; the full experimental section must supply these details (including specific datasets, shot settings, and statistical significance) to substantiate the claim, as the current presentation leaves the empirical support unverifiable.
  2. §3.2 (Prototypical Loss): The prototypical loss is described as reducing intra-category variance to enhance embedding alignment, but without an explicit equation or derivation showing how prototypes are computed (e.g., class means in feature space) and how the loss balances intra- vs. inter-class terms, it is unclear whether the formulation is parameter-free or risks introducing domain-specific biases not captured in the reported experiments.
minor comments (2)
  1. §3.1 (Point Prompter): Clarify the exact architecture and parameter count of the Point Prompter relative to the frozen backbone to strengthen the parameter-efficiency argument.
  2. Figure 1 and §3: Ensure the diagram of the overall P³T pipeline explicitly labels the flow from point cloud through Point Prompter to the VLM and the integration of the prototypical loss during training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below and have made revisions to strengthen the presentation of our empirical results and the formal description of the prototypical loss.

read point-by-point responses
  1. Referee: Abstract and §4 (Experiments): The central claim that P³T matches or outperforms full fine-tuning and shows robust cross-dataset generalization is stated without any quantitative results, baseline comparisons, error bars, or ablation tables in the abstract; the full experimental section must supply these details (including specific datasets, shot settings, and statistical significance) to substantiate the claim, as the current presentation leaves the empirical support unverifiable.

    Authors: We appreciate the referee's emphasis on verifiability. The abstract is kept concise per standard practice, but we have revised it to include key quantitative highlights (e.g., accuracy gains on ModelNet40 classification and few-shot tasks relative to full fine-tuning). Section 4 already provides the requested details: comprehensive tables comparing P³T to full fine-tuning and other baselines across specific datasets (ModelNet40, ScanObjectNN, ShapeNet), shot settings (1-shot to 16-shot), ablation studies on each component, and cross-dataset generalization results. In the revision we have added error bars to all main tables and a brief note on statistical significance testing to further substantiate the claims. revision: yes

  2. Referee: §3.2 (Prototypical Loss): The prototypical loss is described as reducing intra-category variance to enhance embedding alignment, but without an explicit equation or derivation showing how prototypes are computed (e.g., class means in feature space) and how the loss balances intra- vs. inter-class terms, it is unclear whether the formulation is parameter-free or risks introducing domain-specific biases not captured in the reported experiments.

    Authors: We thank the referee for noting the need for greater formality. In the revised §3.2 we now include the explicit equation: prototypes are computed as the mean of L2-normalized embeddings per class in the batch; the loss is L_proto = L_intra + λ L_inter, where L_intra pulls samples to their class prototype and L_inter repels different prototypes. The formulation introduces no new parameters beyond the prompters themselves. Cross-dataset results already demonstrate that no harmful domain-specific bias is introduced, as performance remains stable or improves under distribution shift; we have added a short derivation paragraph explaining the variance-reduction motivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces P3T as an empirical method for parameter-efficient adaptation of 3D VLMs via point-level and text prompters plus a prototypical loss to reduce intra-category variance. All central claims (matching or outperforming full fine-tuning in classification/few-shot settings and robust cross-dataset generalization) are presented as outcomes of experiments rather than any first-principles derivation or prediction. No equations, uniqueness theorems, self-citations as load-bearing premises, or fitted parameters renamed as predictions appear in the abstract or described structure. The approach is self-contained against external benchmarks through reported implementation details and results, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The method introduces three new components whose effectiveness is asserted without independent external validation beyond the paper's own experiments.

invented entities (3)
  • Point Prompter no independent evidence
    purpose: Generates instance-aware point-level prompts for input point clouds
    New module proposed to operate directly on 3D data
  • Text Prompter no independent evidence
    purpose: Replaces hand-crafted text prompts with learnable ones
    New module for text-side adaptation
  • prototypical loss no independent evidence
    purpose: Reduces intra-category variance to improve embedding alignment
    New loss term introduced for fine-tuning 3D VLMs

pith-pipeline@v0.9.0 · 5573 in / 1331 out tokens · 39211 ms · 2026-05-10T09:02:56.892888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    Pointnet: Deep learning on point sets for 3d classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 652–660

  2. [2]

    Pointcnn: Con- volution on x-transformed points,

    Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Con- volution on x-transformed points,” inAdvances in Neural Information Processing Systems, 2018

  3. [3]

    Pointconv: Deep convolutional networks on 3d point clouds,

    W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, pp. 9621–9630

  4. [4]

    Point trans- former,

    H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point trans- former,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 16 259–16 268

  5. [5]

    Pointmixer: Mlp-mixer for point cloud understanding,

    J. Choe, C. Park, F. Rameau, J. Park, and I. S. Kweon, “Pointmixer: Mlp-mixer for point cloud understanding,” inEuropean Conference on Computer Vision (ECCV), 2022, pp. 620–640

  6. [6]

    Rethinking network design and local geometry in point cloud: A simple residual MLP framework,

    X. Ma, C. Qin, H. You, H. Ran, and Y . Fu, “Rethinking network design and local geometry in point cloud: A simple residual MLP framework,” inInternational Conference on Learning Representations, 2022

  7. [7]

    Pointcon- trast: Unsupervised pre-training for 3d point cloud understanding,

    S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany, “Pointcon- trast: Unsupervised pre-training for 3d point cloud understanding,” in European Conference on Computer Vision (ECCV), 2020

  8. [8]

    Meshmae: Masked autoencoders for 3d mesh data analysis,

    Y . Liang, S. Zhao, B. Yu, J. Zhang, and F. He, “Meshmae: Masked autoencoders for 3d mesh data analysis,” inEuropean Conference on Computer Vision (ECCV), S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., 2022, pp. 37–54

  9. [9]

    Point-bert: Pre-training 3d point cloud transformers with masked point modeling,

    X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 19 313–19 322

  10. [10]

    Point-m2AE: Multi-scale masked autoencoders for hier- archical point cloud pre-training,

    R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2AE: Multi-scale masked autoencoders for hier- archical point cloud pre-training,” inAdvances in Neural Information Processing Systems, 2022, pp. 27 061–27 074

  11. [11]

    Pointclustering: Un- supervised point cloud pre-training using transformation invariance in clustering,

    F. Long, T. Yao, Z. Qiu, L. Li, and T. Mei, “Pointclustering: Un- supervised point cloud pre-training using transformation invariance in clustering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 21 824–21 834

  12. [12]

    Point cloud pre-training with diffusion models,

    X. Zheng, X. Huang, G. Mei, Y . Hou, Z. Lyu, B. Dai, W. Ouyang, and Y . Gong, “Point cloud pre-training with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 22 935–22 945

  13. [13]

    Groupcontrast: Semantic-aware self-supervised representation learn- ing for 3d understanding,

    C. Wang, L. Jiang, X. Wu, Z. Tian, B. Peng, H. Zhao, and J. Jia, “Groupcontrast: Semantic-aware self-supervised representation learn- ing for 3d understanding,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 4917–4928

  14. [14]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

    L. Xue, M. Gao, C. Xing, R. Mart ´ın-Mart´ın, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 1179–1189

  15. [15]

    Ulip-2: Towards scalable multimodal pre-training for 3d understanding,

    L. Xue, N. Yu, S. Zhang, A. Panagopoulou, J. Li, R. Mart ´ın-Mart´ın, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip-2: Towards scalable multimodal pre-training for 3d understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 27 091–27 101

  16. [16]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 18–24 J...

  17. [18]

    Prefix-tuning: Optimizing continuous prompts for generation,

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug. 2021, pp. 4582–4597

  18. [19]

    Learning to prompt for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” inInternational Journal of Computer Vision (IJCV), Sept. 2022, pp. 2337–2348

  19. [20]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 16 816–16 825

  20. [21]

    Visual prompt tuning,

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision (ECCV), S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., 2022, pp. 709–727

  21. [22]

    Maple: Multi-modal prompt learning,

    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 113–19 122

  22. [23]

    Distribution-aware prompt tuning for vision-language models,

    E. Cho, J. Kim, and H. J. Kim, “Distribution-aware prompt tuning for vision-language models,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2023, pp. 22 004–22 013

  23. [24]

    Tcp:textual-based class-aware prompt tuning for visual-language model,

    H. Yao, R. Zhang, and C. Xu, “Tcp:textual-based class-aware prompt tuning for visual-language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 23 438–23 448

  24. [25]

    Instance- aware dynamic prompt tuning for pre-trained point cloud models,

    Y . Zha, J. Wang, T. Dai, B. Chen, Z. Wang, and S.-T. Xia, “Instance- aware dynamic prompt tuning for pre-trained point cloud models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 14 161–14 170

  25. [26]

    Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,

    X. Zhou, D. Liang, W. Xu, X. Zhu, Y . Xu, Z. Zou, and X. Bai, “Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 14 707–14 717

  26. [27]

    Parameter-efficient prompt learning for 3d point cloud understanding,

    H. Sun, Y . Wang, W. Chen, H. Deng, and D. Li, “Parameter-efficient prompt learning for 3d point cloud understanding,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 9478–9486

  27. [28]

    Exploring visual prompts for adapting large-scale models,

    H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring visual prompts for adapting large-scale models,” 2022

  28. [29]

    Blackvip: Black-box visual prompting for robust transfer learning,

    C. Oh, H. Hwang, H.-y. Lee, Y . Lim, G. Jung, J. Jung, H. Choi, and K. Song, “Blackvip: Black-box visual prompting for robust transfer learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 24 224–24 235

  29. [30]

    Visual-language prompt tuning with knowledge-guided context optimization,

    H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6757–6767

  30. [31]

    Self-regulating prompts: Foundational model adaptation without forgetting,

    M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 15 190– 15 200

  31. [32]

    Consistency-guided prompt learning for vision-language models,

    S. Roy and A. Etemad, “Consistency-guided prompt learning for vision-language models,” inInternational Conference on Learning Representations, 2024

  32. [33]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 18–24 Jul 2021, pp. 4904–4916

  33. [34]

    Slip: Self-supervision meets language-image pre-training,

    N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” inEuropean Conference on Com- puter Vision (ECCV), S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., 2022, pp. 529–544

  34. [35]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 24 185–24 198

  35. [36]

    The power of scale for parameter-efficient prompt tuning,

    B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” inProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, Nov. 2021, pp. 3045–3059

  36. [37]

    Autoprompt: Eliciting knowledge from language models with auto- matically generated prompts,

    T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with auto- matically generated prompts,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020, pp. 4222–4235

  37. [38]

    How can we know when language models know? on the calibration of language models for question answering,

    Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for question answering,”Transactions of the Association for Computa- tional Linguistics, vol. 9, pp. 962–977, 2021

  38. [39]

    Bitfit: Sim- ple parameter-efficient fine-tuning for transformer-based masked language-models,

    E. Ben Zaken, Y . Goldberg, and S. Ravfogel, “Bitfit: Sim- ple parameter-efficient fine-tuning for transformer-based masked language-models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), May 2022, pp. 1–9

  39. [40]

    Prompt distribution learning,

    Y . Lu, J. Liu, Y . Zhang, Y . Liu, and X. Tian, “Prompt distribution learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5206–5215

  40. [41]

    Prompt generation networks for input-space adaptation of frozen vision transformers,

    J. Loedeman, M. C. Stol, T. Han, and Y . M. Asano, “Prompt generation networks for input-space adaptation of frozen vision transformers,” in British Machine Vision Conference (BMVC), 2024

  41. [42]

    Read- only prompt optimization for vision-language few-shot learning,

    D. Lee, S. Song, J. Suh, J. Choi, S. Lee, and H. J. Kim, “Read- only prompt optimization for vision-language few-shot learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 1401–1411

  42. [43]

    Parameter-efficient transfer learning for NLP,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 2790–2799

  43. [44]

    Adaptformer: Adapting vision transformers for scalable visual recog- nition,

    S. Chen, C. GE, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recog- nition,” inAdvances in Neural Information Processing Systems, 2022

  44. [45]

    Towards a unified view of parameter-efficient transfer learning,

    J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” inInternational Conference on Learning Representations, 2022

  45. [46]

    Fact: Factor-tuning for lightweight adaptation on vision transformer,

    S. Jie and Z.-H. Deng, “Fact: Factor-tuning for lightweight adaptation on vision transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1060–1068, Jun. 2023

  46. [47]

    Cheap and quick: Efficient vision-language instruction tuning for large language models,

    G. Luo, Y . Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and quick: Efficient vision-language instruction tuning for large language models,” inAdvances in Neural Information Processing Systems, 2023

  47. [48]

    Point-peft: Parameter-efficient fine-tuning for 3d pre- trained models,

    Y . Tang, R. Zhang, Z. Guo, X. Ma, B. Zhao, Z. Wang, D. Wang, and X. Li, “Point-peft: Parameter-efficient fine-tuning for 3d pre- trained models,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, pp. 5171–5179, Mar. 2024

  48. [49]

    Point- PRC: A prompt learning based regulation framework for generalizable point cloud analysis,

    H. Sun, Q. Ke, Y . Wang, W. Chen, K. Yang, D. Li, and J. Cai, “Point- PRC: A prompt learning based regulation framework for generalizable point cloud analysis,” inAdvances in Neural Information Processing Systems, 2024

  49. [50]

    Dynamic graph cnn for learning on point clouds,

    Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,”ACM Transactions on Graphics (TOG), 2019

  50. [51]

    Prototypical networks for few- shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few- shot learning,” inAdvances in Neural Information Processing Systems, 2017

  51. [52]

    Pointnet: Deep learning on point sets for 3d classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  52. [53]

    Mvtn: Multi-view trans- formation network for 3d shape recognition,

    A. Hamdi, S. Giancola, and B. Ghanem, “Mvtn: Multi-view trans- formation network for 3d shape recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1–11

  53. [54]

    Unsupervised point cloud pre-training via occlusion completion,

    H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner, “Unsupervised point cloud pre-training via occlusion completion,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9782–9792

  54. [55]

    Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,

    M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thi- lakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 9902–9912

  55. [56]

    Masked discrimination for self- supervised learning on point clouds,

    H. Liu, M. Cai, and Y . J. Lee, “Masked discrimination for self- supervised learning on point clouds,” inEuropean Conference on Computer Vision (ECCV), S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds., 2022, pp. 657–675

  56. [57]

    PointGPT: Auto-regressively generative pre-training from point clouds,

    G. Chen, M. Wang, Y . Yang, K. Yu, L. Yuan, and Y . Yue, “PointGPT: Auto-regressively generative pre-training from point clouds,” inThirty- seventh Conference on Neural Information Processing Systems, 2023

  57. [58]

    Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning?

    R. Dong, Z. Qi, L. Zhang, J. Zhang, J. Sun, Z. Ge, L. Yi, and K. Ma, “Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning?” inThe Eleventh International Conference on Learning Representations, 2023

  58. [59]

    Contrast with reconstruct: Contrastive 3D representation learning guided by generative pretraining,

    Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and L. Yi, “Contrast with reconstruct: Contrastive 3D representation learning guided by generative pretraining,” inProceedings of the 40th International Con- ference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 28 223–28 243

  59. [60]

    3d shapenets: A deep representation for volumetric shapes,

    Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

  60. [61]

    Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,

    M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  61. [62]

    Objaverse: A universe of annotated 3d objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. Vander- Bilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 13 142–13 153

  62. [63]

    Shapenet: An information-rich 3d model repository,

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” 2015

  63. [64]

    Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,

    X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 2639– 2650