arxiv: 2604.19432 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He , Yansong Zheng , Qianru Han , Zhichuan Wang , Yuxuan Cai , Yang Zhou , Jingbo Xia , Yulong Wang

show 2 more authors

Jinhai Xiang Xiang Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-set 3D object retrievalDINO encoderCLIP adaptationvirtual feature synthesismulti-view integration3DOR benchmarksChunking and Adapting Module

0 comments

The pith

DEC adapts DINO with CLIP-synthesized virtual features to retrieve unseen 3D objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a DINO encoder can be adapted for open-set 3D object retrieval without overfitting to known classes by adding dynamic multi-view processing and regularization from synthesized unseen data. It observes that frozen DINO with simple mean-pooling already works reasonably well, yet standard fine-tuning collapses to average known-class patterns. The proposed solution uses the Chunking and Adapting Module to break views into chunks and integrate local relations, plus the Virtual Feature Synthesis module that draws on CLIP's aligned space to create proxy features for missing classes. A sympathetic reader would care because real retrieval systems routinely encounter objects outside any fixed training set, and this method offers a route to better generalization without exhaustive new labeling.

Core claim

We propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly

What carries the argument

The Chunking and Adapting Module (CAM) together with the Virtual Feature Synthesis (VFS) module, where CAM segments multi-view images for dynamic local integration and VFS uses CLIP's pre-aligned space to generate virtual features that regularize training against known-class bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If VFS succeeds, pre-aligned vision-language spaces could serve as a general source of regularization for any 3D model facing unknown categories.
The same chunking-plus-synthesis pattern might transfer to point-cloud or voxel inputs once a suitable backbone replaces the multi-view DINO stage.
One could test whether other self-supervised encoders paired with the same VFS step produce comparable gains on the same open-set 3DOR benchmarks.

Load-bearing premise

Synthesizing virtual features for unseen classes via CLIP's pre-aligned space will reliably improve discrimination of real unseen objects without introducing new biases or distribution shifts.

What would settle it

A controlled experiment that adds or removes the VFS module and measures whether retrieval accuracy on a set of real unseen 3D objects rises or falls.

Figures

Figures reproduced from arXiv: 2604.19432 by Jingbo Xia, Jinhai Xiang, Qianru Han, Xiang Bai, Xinwei He, Yang Zhou, Yansong Zheng, Yulong Wang, Yuxuan Cai, Zhichuan Wang.

**Figure 2.** Figure 2: Overview of our framework. During training, it processes known-category images by frozen DINO and CLIP image encoders, while encoding known and unseen (e.g., ImageNet) class labels via CLIP’s text encoder using the prompt ”a photo of [class]”. Based on them, we synthesize virtual features to train our chunking and adapting (CAM) adapter jointly with an end-to-end metric learning loss, encouraging it to pro… view at source ↗

**Figure 4.** Figure 4: Illustration of our virtual feature synthesis module. The final, enriched concept space is defined as the union of the original and new labels: Y = Yseen ∪ Ynew, which provides a scalable foundation for incorporating a much broader universe of semantic concepts. Aligned embeddings extraction. For each view image I i m of a 3D training object o i , we utilize the CLIP visual encoder to map into aligned vis… view at source ↗

**Figure 6.** Figure 6: Impact of training sample numbers per class on OS-MN40-core. chunk size mAP↑ NDCG↑ ANMRR↓ 1 63.62 75.38 38.77 3 67.62 77.67 34.95 5 64.91 75.68 37.08 7 65.72 76.42 36.87 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Retrieval example comparisons with other methods on OS-MN40-core. Incorrect matches are in red boxes. 5. Conclusions In this paper, we presented a new framework, DINO Eats CLIP (DEC), to adapt multi-view images for open-set 3DOR. Building upon the robust self-supervised view representations, we designed a novel chunking and adapting module to effectively integrates rich local relations across views into c… view at source ↗

**Figure 8.** Figure 8: Effect of view numbers (N) on OS-MN40-core. for task-specific feature adjustment. In such settings, the adapter contributes more complementary information, and the model benefits from assigning a larger relative weight to the adapted features. Dataset Backbone λ mAP↑ NDCG↑ ANMRR↓ OS-ESB-core ViT-B/14 λ = 0.11 61.82 24.55 42.74 ViT-L/14 λ = 0.11 60.59 24.46 43.30 OS-NTU-core ViT-B/14 λ = 0.11 61.56 27.26 41… view at source ↗

**Figure 9.** Figure 9: presents more retrieval examples on OS-MN40- Core. As shown, DEC faithfully retrieves relevant 3D objects for 3D query objects of common classes such as bathtub, door, and bed. However, certain challenge cases (row 4-6) exist for classes such as stool, bowl, and bottle, leading to failures. For instance, in row 5, a bowl query is incorrectly matched with instances from the vase, despite the subtle diff… view at source ↗

read the original abstract

Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper swaps DINO for CLIP in open-set 3D retrieval, adds chunked adaptation to curb overfitting, and tries to regularize with CLIP-synthesized virtual features, but the transfer mechanism between the two backbones stays underspecified.

read the letter

The main takeaway is that mean-pooling a frozen DINO backbone already works decently for 3D object retrieval, but further adaptation overfits to known-class view patterns. The authors respond with a Chunking and Adapting Module that breaks multi-view images into chunks and integrates local relations dynamically, plus a Virtual Feature Synthesis step that pulls in CLIP to generate examples for unseen classes. That combination is the concrete new piece relative to prior CLIP adaptation work in this area. The baseline observation and the overfitting diagnosis are useful and clearly stated. The CAM looks like a straightforward engineering fix that could improve robustness on multi-view data. Experiments are reported on standard open-set 3DOR benchmarks and claim better results than prior methods. The VFS idea is a reasonable attempt to inject open-set knowledge without new labeled 3D data. The soft spot is exactly where the stress test points: the paper does not describe a projection, adapter, or loss that would map the CLIP-synthesized vectors into the distribution of real DINO features. Without that bridge, it is unclear whether the virtual features actually regularize the model or simply sit outside its support and get ignored. Any measured gains could therefore trace more to the CAM or to benchmark choices than to the synthesis step. The abstract and description give no numbers, ablations, or error breakdowns, so the strength of the evidence is hard to judge from the given material. This work is for people already working on view-based 3D retrieval or on adapting 2D foundation models to 3D tasks. A reader who needs practical tweaks to multi-view pipelines can extract the CAM design and the basic DINO baseline without much trouble. The proposal is coherent enough and the problem is real enough that it deserves a serious referee, even if the reviewers will need to press on the feature alignment and the ablation details. I would send it for review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DINO Eats CLIP (DEC), a framework for open-set 3D object retrieval that freezes a DINO backbone, applies mean-pooling over multi-view features as a baseline, introduces a Chunking and Adapting Module (CAM) to segment views and dynamically integrate local relations to reduce overfitting on known-class patterns, and adds a Virtual Feature Synthesis (VFS) module that leverages CLIP's pre-aligned vision-language space to generate virtual features for unseen classes, thereby mitigating known-class bias and improving open-set discrimination.

Significance. If the central claims hold after addressing the integration details, the work would be significant for showing a practical way to adapt recent self-supervised encoders like DINO to open-set 3D retrieval tasks by combining dynamic multi-view pooling with cross-model virtual data synthesis from CLIP, potentially offering better fine-grained generalization than direct CLIP adaptation while keeping the backbone frozen.

major comments (2)

[VFS module] VFS module (described after CAM): the manuscript states that virtual features synthesized in CLIP space are used to 'expose DEC' to unseen classes, but provides no description of a projection, adapter, or alignment loss that would map these features into the support of the frozen DINO feature distribution; without such a mechanism the claimed regularization effect against known-class bias cannot be guaranteed and any benchmark gains could be artifacts of CAM alone.
[Experiments] Experimental claims (abstract and results section): the text asserts 'extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy' yet the provided manuscript contains no quantitative numbers, ablation tables isolating CAM versus VFS, or analysis of distribution shift between real DINO features and CLIP-synthesized virtual features, making it impossible to verify that VFS is load-bearing for the open-set gains.

minor comments (1)

[Abstract] Abstract: the sentence 'we first find that simply mean-pooling... gives decent performance' would benefit from a brief parenthetical reference to the specific benchmark and metric where this baseline was measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and outline the revisions planned for the manuscript.

read point-by-point responses

Referee: [VFS module] VFS module (described after CAM): the manuscript states that virtual features synthesized in CLIP space are used to 'expose DEC' to unseen classes, but provides no description of a projection, adapter, or alignment loss that would map these features into the support of the frozen DINO feature distribution; without such a mechanism the claimed regularization effect against known-class bias cannot be guaranteed and any benchmark gains could be artifacts of CAM alone.

Authors: We agree that the VFS module description in the current manuscript lacks explicit details on the mapping mechanism. In the revised version, we will add a dedicated subsection describing the projection (a trainable linear layer followed by normalization) and the alignment loss (a combination of MSE and contrastive terms) that maps CLIP-synthesized virtual features into the distribution of the frozen DINO features. This addition will clarify how VFS provides regularization against known-class bias beyond the effects of CAM. revision: yes
Referee: [Experiments] Experimental claims (abstract and results section): the text asserts 'extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy' yet the provided manuscript contains no quantitative numbers, ablation tables isolating CAM versus VFS, or analysis of distribution shift between real DINO features and CLIP-synthesized virtual features, making it impossible to verify that VFS is load-bearing for the open-set gains.

Authors: We acknowledge that the submitted manuscript draft omits the quantitative results, ablation tables, and distribution-shift analysis. The revised manuscript will include complete experimental sections with benchmark performance numbers, ablations that separately disable CAM and VFS, and supporting analyses (e.g., feature-space distance metrics and visualizations) demonstrating the contribution of VFS to open-set gains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external pretrained models and empirical validation

full rationale

The paper's derivation introduces DEC with two new modules (CAM for dynamic view integration and VFS for synthesizing virtual features from CLIP's vision-language space) applied to a frozen DINO backbone. VFS explicitly depends on an external pretrained CLIP model rather than any quantity fitted or defined within the current work. No equations, self-citations, or uniqueness theorems are presented that would reduce the open-set gains to a tautology or to parameters of the present model. Performance claims are grounded in experiments on standard benchmarks, making the chain self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that DINO provides finer-grained features than CLIP and that CLIP's vision-language space can be used to synthesize useful virtual features for unseen classes without further validation.

axioms (2)

domain assumption DINO features are sufficiently fine-grained for 3D view aggregation
Stated as motivation for switching from CLIP; no proof or citation of prior verification for 3DOR.
domain assumption CLIP's pre-aligned space can generate virtual features that improve open-set discrimination
Core of the VFS module; treated as given.

pith-pipeline@v0.9.0 · 5580 in / 1429 out tokens · 30587 ms · 2026-05-10T02:51:24.119895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Learning representations and generative models for 3d point clouds

Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. InICML, pages 40–49, 2018. 1

2018
[2]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation. InICCV, pages 22025–22035,
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021. 2, 3

2021
[4]

Siamese cnn-bilstm architecture for 3d shape representation learning

Guoxian Dai, Jin Xie, Yi Fang, et al. Siamese cnn-bilstm architecture for 3d shape representation learning. InIJCAI, pages 670–676, 2018. 2

2018
[5]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, pages 13142– 13153, 2023. 2

2023
[6]

V os: Learning what you don’t know by virtual outlier synthesis

Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. V os: Learning what you don’t know by virtual outlier synthesis. InICLR, 2022. 2

2022
[7]

Equivariant multi-view networks

Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant multi-view networks. In ICCV, pages 1568–1577, 2019. 2

2019
[8]

Gvcnn: Group-view convolutional neural networks for 3d shape recognition

Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. InCVPR, pages 264–272, 2018. 1, 2

2018
[9]

Shrec’22 track: Open-set 3d object retrieval.Computers & Graphics, pages 231–240,

Yifan Feng, Yue Gao, Xibin Zhao, Yandong Guo, Nihar Bagewadi, Nhat-Tan Bui, Hieu Dao, Shankar Gangisetty, Ripeng Guan, Xie Han, et al. Shrec’22 track: Open-set 3d object retrieval.Computers & Graphics, pages 231–240,
[10]

Hypergraph-based multi-modal represen- tation for open-set 3d object retrieval.IEEE TPAMI, 2023

Yifan Feng, Shuyi Ji, Yu-Shen Liu, Shaoyi Du, Qionghai Dai, and Yue Gao. Hypergraph-based multi-modal represen- tation for open-set 3d object retrieval.IEEE TPAMI, 2023. 1, 2, 3, 5, 6, 8

2023
[11]

3d-future: 3d fur- niture shape with texture.IJCV, 129(12):3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.IJCV, 129(12):3313–3337, 2021. 3

2021
[12]

Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024. 7, 8

2024
[13]

Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection

Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection. InCVPR, pages 20405–20415, 2025. 2

2025
[14]

3d2seqviews: Aggregating sequential views for 3d global feature learning by cnn with hierarchical attention ag- gregation.IEEE TIP, 28(8):3986–3999, 2019

Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man V ong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and CL Philip Chen. 3d2seqviews: Aggregating sequential views for 3d global feature learning by cnn with hierarchical attention ag- gregation.IEEE TIP, 28(8):3986–3999, 2019. 2

2019
[15]

Triplet-center loss for multi-view 3d object retrieval

Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. In CVPR, pages 1945–1954, 2018. 6, 2

1945
[16]

View n-gram network for 3d object retrieval

Xinwei He, Tengteng Huang, Song Bai, and Xiang Bai. View n-gram network for 3d object retrieval. InICCV, pages 7515–7524, 2019. 2

2019
[17]

Latformer: Locality-aware point- view fusion transformer for 3d shape recognition.PR, page 110413, 2024

Xinwei He, Silin Cheng, Dingkang Liang, Song Bai, Xi Wang, and Yingying Zhu. Latformer: Locality-aware point- view fusion transformer for 3d shape recognition.PR, page 110413, 2024. 1

2024
[18]

Clip-adam: Adapt- ing multi-view clip for open-set 3d object retrieval

Xinwei He, Liang Ma, Yuxuan Cheng, Zhichuan Wang, Yu- long Wang, Yang Zhou, and Xiang Bai. Clip-adam: Adapt- ing multi-view clip for open-set 3d object retrieval. InSIGIR, pages 1022–1032, 2025. 1, 2, 3, 5, 6, 7, 8

2025
[19]

Scal- able deep multimodal learning for cross-modal retrieval

Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. Scal- able deep multimodal learning for cross-modal retrieval. In SIGIR, pages 635–644, 2019. 6, 2

2019
[20]

Developing an engineering shape benchmark for cad models.Computer-Aided Design, 38(9):939–953, 2006

Subramaniam Jayanti, Yagnanarayanan Kalyanaraman, Na- traj Iyer, and Karthik Ramani. Developing an engineering shape benchmark for cad models.Computer-Aided Design, 38(9):939–953, 2006. 6

2006
[21]

Cross-modal center loss for 3d cross-modal retrieval

Longlong Jing, Elahe Vahdani, Jiaxing Tan, and Yingli Tian. Cross-modal center loss for 3d cross-modal retrieval. In CVPR, pages 3142–3151, 2021. 1, 6, 8, 2

2021
[22]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, pages 24905–24916, 2025. 2

2025
[23]

Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints

Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. InCVPR, pages 5010–5019, 2018. 2

2018
[24]

Angular triplet- center loss for multi-view 3d shape retrieval

Zhaoqun Li, Cheng Xu, and Biao Leng. Angular triplet- center loss for multi-view 3d shape retrieval. InAAAI, pages 8682–8689, 2019. 2

2019
[25]

Openshape: Scaling up 3d shape representation towards open-world understanding.NeurIPS, 36, 2023

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding.NeurIPS, 36, 2023. 6

2023
[26]

Point2sequence: Learning the shape representa- tion of 3d point clouds with an attention-based sequence to sequence network

Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representa- tion of 3d point clouds with an attention-based sequence to sequence network. InAAAI, pages 8778–8785, 2019. 1

2019
[27]

Relation-shape convolutional neural network for point cloud analysis

Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. InCVPR, pages 8895–8904, 2019. 2

2019
[28]

V oxnet: A 3d con- volutional neural network for real-time object recognition

Daniel Maturana and Sebastian Scherer. V oxnet: A 3d con- volutional neural network for real-time object recognition. In IROS, pages 922–928, 2015. 2

2015
[29]

Mmjn: Multi-modal joint networks for 3d shape recognition

Weizhi Nie, Qi Liang, An-An Liu, Zhendong Mao, and Yangyang Li. Mmjn: Multi-modal joint networks for 3d shape recognition. InACM MM, pages 908–916, 2019. 1, 2

2019
[30]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 6, 8, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024. 2, 3

2024
[32]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, pages 652–660, 2017. 2

2017
[33]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 30, 2017. 2, 3

2017
[34]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 2, 3

2021
[35]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Deep- voxels: Learning persistent 3d feature embeddings

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep- voxels: Learning persistent 3d feature embeddings. In CVPR, pages 2437–2446, 2019. 1

2019
[37]

Mv-clip: Multi-view clip for zero-shot 3d shape recog- nition.arXiv preprint arXiv:2311.18402, 2023

Dan Song, Xinwei Fu, Weizhi Nie, Wenhui Li, and Anan Liu. Mv-clip: Multi-view clip for zero-shot 3d shape recog- nition.arXiv preprint arXiv:2311.18402, 2023. 2

work page arXiv 2023
[38]

Multi-view convolutional neural networks for 3d shape recognition

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InICCV, pages 945–953, 2015. 2, 3

2015
[39]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 2

2025
[40]

O-cnn: Octree-based convolutional neural networks for 3d shape analysis.ACM TOG, 36(4):1–11,

Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis.ACM TOG, 36(4):1–11,
[41]

Multi-similarity loss with general pair weighting for deep metric learning

Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. InCVPR, pages 5022– 5030, 2019. 5

2019
[42]

Dynamic graph cnn for learning on point clouds.ACM TOG, 38(5): 1–12, 2019

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM TOG, 38(5): 1–12, 2019. 2

2019
[43]

Improving zero-shot generalization for clip with synthesized prompts.ICCV, 2023

Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts.ICCV, 2023. 2

2023
[44]

Describe, adapt and combine: Empowering clip encoders for open-set 3d object retrieval

Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yu- long Wang, Xinwei He, and Xiang Bai. Describe, adapt and combine: Empowering clip encoders for open-set 3d object retrieval. InICCV, pages 21026–21036, 2025. 1, 2, 3, 5, 6

2025
[45]

Teda: Boosting vision-lanuage models for zero-shot 3d object retrieval via testing-time distribution alignment

Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, and Xinwei He. Teda: Boosting vision-lanuage models for zero-shot 3d object retrieval via testing-time distribution alignment. InICMR, pages 1442–1451, 2025. 2, 3

2025
[46]

View-gcn: View-based graph convolutional network for 3d shape analysis

Xin Wei, Ruixuan Yu, and Jian Sun. View-gcn: View-based graph convolutional network for 3d shape analysis. InCVPR, pages 1850–1859, 2020. 1

2020
[47]

Multi- modal semantic autoencoder for cross-modal retrieval.Neu- rocomputing, pages 165–175, 2019

Yiling Wu, Shuhui Wang, and Qingming Huang. Multi- modal semantic autoencoder for cross-modal retrieval.Neu- rocomputing, pages 165–175, 2019. 1, 6, 8, 2

2019
[48]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015. 2

1912
[49]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InCVPR, pages 1179–1189, 2023. 2

2023
[50]

Ulip-2: Towards scal- able multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scal- able multimodal pre-training for 3d understanding. InCVPR, pages 27091–27101, 2024. 2, 6

2024
[51]

Generalized out-of-distribution detection: A survey.IJCV, pages 5635–5662, 2024

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey.IJCV, pages 5635–5662, 2024. 2

2024
[52]

Adap- tive prompt learning via gaussian outlier synthesis for out- of-distribution detection

Yongkang Zhang, Dongyu She, and Zhong Zhou. Adap- tive prompt learning via gaussian outlier synthesis for out- of-distribution detection. InICCV, pages 3235–3244, 2025. 2

2025
[53]

Pointweb: Enhancing local neighborhood features for point cloud processing

Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia. Pointweb: Enhancing local neighborhood features for point cloud processing. InCVPR, pages 5565–5573, 2019. 2

2019
[54]

Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emo- tion recognition.IEEE TMM, 2022

Jiahao Zheng, Sen Zhang, Zilu Wang, Xiaoping Wang, and Zhigang Zeng. Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emo- tion recognition.IEEE TMM, 2022. 6, 8, 2

2022
[55]

Learn- ing placeholders for open-set recognition

Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Learn- ing placeholders for open-set recognition. InCVPR, pages 4401–4410, 2021. 6, 8, 2

2021
[56]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. InICLR, 2024. 2, 6

2024
[57]

Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InICCV, pages 2639–2650, 2023. 3 DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval Supplementary Material In the supplementary material, we first provide m...

2023