HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

Ayellet Tal; Koby Aharonov; Oren Shrout

arxiv: 2606.01334 · v1 · pith:2IBY5H34new · submitted 2026-05-31 · 💻 cs.CV

HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

Koby Aharonov , Oren Shrout , Ayellet Tal This is my paper

Pith reviewed 2026-06-28 17:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-set 3D recognitionmulti-modal alignmentcontrastive losszero-shot learningpoint cloudopen-vocabulary 3Dmulti-positive contrastivetext adapter

0 comments

The pith

Aligning each 3D point cloud with multiple images and texts via a new contrastive loss yields better open-vocabulary recognition than single-alignment methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current 3D models are limited because they match each point cloud to only one image or caption, which gives a partial view of the object. It introduces a way to match each cloud to several images and several texts at once, using a loss that keeps the positives grouped without letting them crowd out hard negatives in the same softmax. A lightweight adapter also lets the model use raw web text more effectively. If this holds, models can recognize rare or unseen 3D categories more reliably in zero-shot settings while running at real-time speeds.

Core claim

The decoupled multi-positive contrastive loss enables joint alignment of a 3D instance with multiple matched multi-view images and multiple texts by separating positive aggregation from negative competition, which sharpens focus on challenging negatives and avoids spotlight crowding that occurs when many positives share the softmax with all negatives.

What carries the argument

The decoupled multi-positive contrastive loss, which aggregates positives separately before competing with negatives.

If this is right

State-of-the-art open-vocabulary performance on long-tail 3D benchmarks.
Substantial zero-shot gains while maintaining high frame rates.
Effective incorporation of large-scale unsupervised web captions through the text adapter.
Avoidance of partial-view anchoring that limits earlier distillation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loss structure could be tested on 2D or video data where multiple views are available but currently under-used.
If the number of positives grows too large, performance might eventually degrade; a controlled scaling experiment would reveal the practical limit.
The approach suggests that open-set 3D systems may benefit more from richer positive sets than from ever-larger negative banks.
Combining this alignment with geometric invariants already known in 3D vision could further reduce the need for any text supervision.

Load-bearing premise

That jointly aligning a 3D instance with multiple matched multi-view images and multiple texts captures a more holistic understanding than single-image or single-caption alignment.

What would settle it

An experiment showing that a model using only single-image or single-text alignment matches or exceeds the multi-alignment version on the same long-tail open-vocabulary benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.01334 by Ayellet Tal, Koby Aharonov, Oren Shrout.

**Figure 1.** Figure 1: (a) The plot shows that our models (in teal) achieve higher accuracy (vertical axis) and frame rates (horizontal axis) than competing methods, while using significantly fewer parameters (indicated by the smaller circle sizes). Results are shown for the challenging long-tail ObjaverseLVIS benchmark. (b) - (c) Given text or image queries, our 3D shape retrieval results indicate a well-structured embedding s… view at source ↗

**Figure 2.** Figure 2: Input Triplets. Each training triplet consists of a point cloud, a set of rendered images, and a set of texts. The texts may include (from top to bottom): (i) annotations, (ii) VLM-generated captions, and (iii) retrieved web captions. During training, the modalities within each triplet are used in turn as anchors: when one modality serves as the anchor, the remaining matched modalities from the same triple… view at source ↗

**Figure 3.** Figure 3: It comprises three encoders, one for each modality. The image and text encoders are kept [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 3.** Figure 3: Model. After each modality is encoded independently, pairwise losses are computed across all modality pairs. The decoupled multi-positive contrastive loss (shown in each rectangle) is one-to-many when involving point clouds and many-to-many when aligning images and texts. The image and text encoders are kept frozen, while the 3D encoder and the text adapter are trainable. Note that the colors (orange, cyan… view at source ↗

**Figure 4.** Figure 4: Effect of the “spotlight crowding” phenomenon. As views increase, the naive MP loss suppresses negative-sample gradients (dashed curves decline), leading to a larger drop in accuracy (solid curves). Results are reported on Objaverse-LVIS Top-1 accuracy for the "ShapeNet" (left) and "Ensembled no LVIS" (right) training sets. (4) and (5), shows the relationship between the gradients with respect to the posit… view at source ↗

**Figure 5.** Figure 5: Text-to-3D retrieval for common (head) categories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. The results indicate a well-structured embedding space across head categories. 4.2 Cross-Modal Retrieval Cross-modal retrieval [4, 23, 42] evaluates how faithfully our shared embedding space aligns m… view at source ↗

**Figure 6.** Figure 6: Text-to-3D retrieval for rare (tail) categories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. The results indicate a well-structured embedding space with robust performance on long-tail classes [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Image-to-3D retrieval for common (head) categories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. The results indicate a well-structured embedding space across head categories. Input 1st Retrieved Result 2nd Retrieved Result 3rd Retrieved Result [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Image-to-3D retrieval for rare (tail) categories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. The results indicate a well-structured embedding space with robust performance on long-tail classes. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Text-to-3D retrieval across closely related subcategories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. These examples demonstrate fine-grained separability among "helmet" subcategories and strong text-to-3D cross-modal consistency. Input 1st Retrieved Result 2nd Retrieved Result 3rd Retrieved … view at source ↗

**Figure 10.** Figure 10: Image-to-3D retrieval across closely related subcategories. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. These examples demonstrate fine-grained separability among "clock" subcategories and strong image-to-3D cross-modal consistency. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Style-aware cross-modal retrieval. Each retrieved result is shown as a pair: an RGB rendering of the retrieved shape and its corresponding point cloud, which is the item returned by the search. Top: style-mismatched text–image queries retrieve style-appropriate crocodile shapes, indicating the shared embedding space encodes semantic concepts across modalitites. Bottom: stylematched bonsai tree queries re… view at source ↗

read the original abstract

Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss's hardness-aware focus on challenging negatives, avoiding the "spotlight crowding" that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HOLA's decoupled multi-positive contrastive loss targets a real issue in multi-view 3D alignment but the abstract gives no numbers or equations to check the gains.

read the letter

The main thing to know is that this paper proposes aligning each 3D point cloud with multiple images and multiple texts using a new decoupled multi-positive contrastive loss. The loss keeps positive aggregation separate from negative competition so that many positives do not crowd the softmax and weaken the hardness-aware signal on tough negatives. They also add a lightweight text adapter that only touches web captions to close the gap with curated text.

The paper does a clean job naming the limitation of current single-alignment distillation approaches that tie 3D encoders to partial views. The multi-signal idea is a logical step for objects that look different from different angles, and the adapter is a practical, low-cost way to bring in large-scale unsupervised text.

The soft spots are straightforward. The abstract states SOTA open-vocabulary results on long-tail benchmarks and substantial zero-shot gains at high frame rates, yet supplies no equations, no ablation tables, and no concrete deltas. Without those it is impossible to judge whether the decoupled loss actually delivers the claimed benefit or whether the multi-alignment overhead is justified. The central assumption that joint multi-view and multi-text alignment produces a more holistic representation also sits untested in the provided text.

This is for people working on open-set 3D recognition or multi-modal contrastive losses in computer vision. Someone already experimenting with multiple positives per anchor might pick up the decoupling trick. The work is coherent enough on its own terms to deserve a serious referee who can check the experiments and loss derivation.

Referee Report

0 major / 3 minor

Summary. The paper proposes HOLA for open-set 3D recognition, which aligns each point cloud instance with multiple matched multi-view images and multiple textual descriptions (rather than single-image or single-caption alignments) to capture holistic object understanding. It introduces a decoupled multi-positive contrastive loss that separates positive aggregation from negative competition to enhance hardness-aware focus and avoid spotlight crowding in the softmax, plus a lightweight text adapter applied only to web captions to reduce domain gap. The model is reported to achieve state-of-the-art open-vocabulary performance on long-tail benchmarks with substantial zero-shot gains while sustaining high frame rates.

Significance. If the central performance claims hold under full experimental validation, the work could meaningfully advance open-vocabulary 3D recognition by demonstrating that multi-positive multi-modal alignment yields better generalization to rare categories than prior single-alignment distillation approaches, with the efficiency of the text adapter offering practical value for scaling to unsupervised web data.

minor comments (3)

Abstract: the phrase 'spotlight crowding' is introduced without a brief definition or reference to prior contrastive-learning literature, which would clarify the motivation for the decoupled loss.
Abstract: 'long-tail benchmarks' and the specific zero-shot metrics (e.g., mAP or accuracy deltas) are not named, making it difficult for readers to immediately locate the supporting tables or compare against the cited baselines.
Abstract: the claim that the text adapter is 'lightweight' and 'applied only to web captions' would benefit from a short statement on parameter count or inference overhead relative to the 2D ViT backbone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. We appreciate the acknowledgment that the multi-positive multi-modal alignment and decoupled contrastive loss could advance open-vocabulary 3D recognition, particularly for long-tail categories.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a new decoupled multi-positive contrastive loss and lightweight text adapter for multi-modal 3D alignment. No derivation chain, equations, or first-principles results are presented that reduce by construction to their own inputs. Claims rest on the empirical performance of the introduced method on benchmarks rather than any fitted parameter renamed as prediction, self-definitional construction, or load-bearing self-citation. The abstract and described content contain no self-referential steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5747 in / 1023 out tokens · 19132 ms · 2026-06-28T17:23:11.451039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9902–9912, 2022

2022
[2]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, and Shanghang Zhang. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5291–5301, 2023

2023
[4]

Text2shape: Generating shapes from natural language by learning joint embeddings

Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. InAsian conference on computer vision, pages 100–116. Springer, 2018

2018
[5]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[6]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023
[7]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019. 14

2019
[8]

Abo: Dataset and benchmarks for real-world 3d object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

2022
[9]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

2021
[11]

3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

2021
[12]

Sculpting holistic 3d representation in contrastive language-image-3d pre-training

Yipeng Gao, Zeyu Wang, Wei-Shi Zheng, Cihang Xie, and Yuyin Zhou. Sculpting holistic 3d representation in contrastive language-image-3d pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22998–23008, June 2024

2024
[13]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019

2019
[14]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[15]

A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

2024
[16]

Clip2point: Transfer clip to point cloud classification with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023

2023
[17]

Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

2020
[18]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

2020
[19]

Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

2025
[20]

Duoduo clip: Efficient 3d understanding with multi-view images

Han-Hung Lee, Yiming Zhang, and Angel X Chang. Duoduo clip: Efficient 3d understanding with multi-view images. InInternational Conference on Learning Representations (ICLR), 2025

2025
[21]

Vit-lens: Towards omni-modal representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. Vit-lens: Towards omni-modal representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26647–26657, June 2024

2024
[22]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022
[23]

Joint embeddings of shapes and images via cnn image purification.ACM transactions on graphics (TOG), 34 (6):1–12, 2015

Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. Joint embeddings of shapes and images via cnn image purification.ACM transactions on graphics (TOG), 34 (6):1–12, 2015

2015
[24]

Openshape: Scaling up 3d shape representation towards open-world understanding

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 44860–44879. Cu...

2023
[25]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Slip: Self-supervision meets language- image pre-training

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language- image pre-training. InEuropean conference on computer vision, pages 529–544. Springer, 2022

2022
[28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

2017
[30]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017
[31]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–238. Springer, 2024

2024
[32]

Learn from zoom: Decoupled supervised contrastive learning for wce image classification

Kunpeng Qiu, Zhiying Zhou, and Yongxin Guo. Learn from zoom: Decoupled supervised contrastive learning for wce image classification. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2245–2249. IEEE, 2024

2024
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[34]

arXiv preprint arXiv:2010.04592 , year=

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010
[35]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

2022
[36]

Sun rgb-d: A rgb-d scene understanding benchmark suite

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

2015
[37]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017
[38]

Stablerep: Synthetic images from text-to-image models make strong visual representation learners.Advances in Neural Information Processing Systems, 36:48382–48402, 2023

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to-image models make strong visual representation learners.Advances in Neural Information Processing Systems, 36:48382–48402, 2023

2023
[39]

Self-supervised representation learning with relative predictive coding.arXiv preprint arXiv:2103.11275, 2021

Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, and Ruslan Salakhutdinov. Self-supervised representation learning with relative predictive coding.arXiv preprint arXiv:2103.11275, 2021

work page arXiv 2021
[40]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019

2019
[41]

Understanding the behaviour of contrastive loss

Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2021

2021
[42]

Cross-modal retrieval: a systematic review of methods and future directions.Proceedings of the IEEE, 112(11):1716–1754, 2025

Tianshi Wang, Fengling Li, Lei Zhu, Jingjing Li, Zheng Zhang, and Heng Tao Shen. Cross-modal retrieval: a systematic review of methods and future directions.Proceedings of the IEEE, 112(11):1716–1754, 2025

2025
[43]

Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019. 16

2019
[44]

Pointnet++ pytorch

Erik Wijmans. Pointnet++ pytorch. https: // github. com/ erikwijmans/ Pointnet2_ PyTorch, 2018

2018
[45]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015

1912
[46]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, June 2023

2023
[47]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27091–27101, June 2024

2024
[48]

Decoupled contrastive learning

Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning. InEuropean conference on computer vision, pages 668–684. Springer, 2022

2022
[49]

Mvimgnet: A large-scale dataset of multi-view images

Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023

2023
[50]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022

2022
[51]

Clip2: Contrastive language-image-point pretraining from real-world point cloud data

Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15244–15253, 2023

2023
[52]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8552–8562, June 2022

2022
[53]

Tamm: Triadapter multi-modal learning for 3d shape understanding

Zhihao Zhang, Shengcao Cao, and Yu-Xiong Wang. Tamm: Triadapter multi-modal learning for 3d shape understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21413–21423, June 2024

2024
[54]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. InInternational Conference on Learning Representations (ICLR), 2024

2024
[55]

Ensembled

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2639–2650, October 2023. 17 Supplementary Material Supplementary material overview.Section ...

2023

[1] [1]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9902–9912, 2022

2022

[2] [2]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, and Shanghang Zhang. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5291–5301, 2023

2023

[4] [4]

Text2shape: Generating shapes from natural language by learning joint embeddings

Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. InAsian conference on computer vision, pages 100–116. Springer, 2018

2018

[5] [5]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020

[6] [6]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023

[7] [7]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019. 14

2019

[8] [8]

Abo: Dataset and benchmarks for real-world 3d object understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

2022

[9] [9]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023

[10] [10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

2021

[11] [11]

3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

2021

[12] [12]

Sculpting holistic 3d representation in contrastive language-image-3d pre-training

Yipeng Gao, Zeyu Wang, Wei-Shi Zheng, Cihang Xie, and Yuyin Zhou. Sculpting holistic 3d representation in contrastive language-image-3d pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22998–23008, June 2024

2024

[13] [13]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019

2019

[14] [14]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020

[15] [15]

A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

2024

[16] [16]

Clip2point: Transfer clip to point cloud classification with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023

2023

[17] [17]

Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning.Advances in neural information processing systems, 33:21798–21809, 2020

2020

[18] [18]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

2020

[19] [19]

Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

2025

[20] [20]

Duoduo clip: Efficient 3d understanding with multi-view images

Han-Hung Lee, Yiming Zhang, and Angel X Chang. Duoduo clip: Efficient 3d understanding with multi-view images. InInternational Conference on Learning Representations (ICLR), 2025

2025

[21] [21]

Vit-lens: Towards omni-modal representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. Vit-lens: Towards omni-modal representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26647–26657, June 2024

2024

[22] [22]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022

[23] [23]

Joint embeddings of shapes and images via cnn image purification.ACM transactions on graphics (TOG), 34 (6):1–12, 2015

Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. Joint embeddings of shapes and images via cnn image purification.ACM transactions on graphics (TOG), 34 (6):1–12, 2015

2015

[24] [24]

Openshape: Scaling up 3d shape representation towards open-world understanding

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 44860–44879. Cu...

2023

[25] [25]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Slip: Self-supervision meets language- image pre-training

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language- image pre-training. InEuropean conference on computer vision, pages 529–544. Springer, 2022

2022

[28] [28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

2017

[30] [30]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017

[31] [31]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–238. Springer, 2024

2024

[32] [32]

Learn from zoom: Decoupled supervised contrastive learning for wce image classification

Kunpeng Qiu, Zhiying Zhou, and Yongxin Guo. Learn from zoom: Decoupled supervised contrastive learning for wce image classification. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2245–2249. IEEE, 2024

2024

[33] [33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[34] [34]

arXiv preprint arXiv:2010.04592 , year=

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010

[35] [35]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

2022

[36] [36]

Sun rgb-d: A rgb-d scene understanding benchmark suite

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

2015

[37] [37]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017

[38] [38]

Stablerep: Synthetic images from text-to-image models make strong visual representation learners.Advances in Neural Information Processing Systems, 36:48382–48402, 2023

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to-image models make strong visual representation learners.Advances in Neural Information Processing Systems, 36:48382–48402, 2023

2023

[39] [39]

Self-supervised representation learning with relative predictive coding.arXiv preprint arXiv:2103.11275, 2021

Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, and Ruslan Salakhutdinov. Self-supervised representation learning with relative predictive coding.arXiv preprint arXiv:2103.11275, 2021

work page arXiv 2021

[40] [40]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019

2019

[41] [41]

Understanding the behaviour of contrastive loss

Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2021

2021

[42] [42]

Cross-modal retrieval: a systematic review of methods and future directions.Proceedings of the IEEE, 112(11):1716–1754, 2025

Tianshi Wang, Fengling Li, Lei Zhu, Jingjing Li, Zheng Zhang, and Heng Tao Shen. Cross-modal retrieval: a systematic review of methods and future directions.Proceedings of the IEEE, 112(11):1716–1754, 2025

2025

[43] [43]

Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019. 16

2019

[44] [44]

Pointnet++ pytorch

Erik Wijmans. Pointnet++ pytorch. https: // github. com/ erikwijmans/ Pointnet2_ PyTorch, 2018

2018

[45] [45]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015

1912

[46] [46]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, June 2023

2023

[47] [47]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27091–27101, June 2024

2024

[48] [48]

Decoupled contrastive learning

Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning. InEuropean conference on computer vision, pages 668–684. Springer, 2022

2022

[49] [49]

Mvimgnet: A large-scale dataset of multi-view images

Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023

2023

[50] [50]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022

2022

[51] [51]

Clip2: Contrastive language-image-point pretraining from real-world point cloud data

Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15244–15253, 2023

2023

[52] [52]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8552–8562, June 2022

2022

[53] [53]

Tamm: Triadapter multi-modal learning for 3d shape understanding

Zhihao Zhang, Shengcao Cao, and Yu-Xiong Wang. Tamm: Triadapter multi-modal learning for 3d shape understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21413–21423, June 2024

2024

[54] [54]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. InInternational Conference on Learning Representations (ICLR), 2024

2024

[55] [55]

Ensembled

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2639–2650, October 2023. 17 Supplementary Material Supplementary material overview.Section ...

2023