arxiv: 2604.22658 · v2 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

Jiaxin Shi , Guofeng Zhang , Wufei Ma , Naifu Liang , Adam Kortylewski , Alan Yuille

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D shape retrievaloccluded viewsanalysis-by-synthesisfeature distillationpose estimationtest-time optimizationmulti-task learningsingle-view 3D

0 comments

The pith

A pose-aware analysis-by-synthesis approach retrieves 3D shapes from single occluded images by distilling 2D features into a 3D encoder and optimizing over shape and pose at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make single-view 3D shape retrieval more reliable when input images contain occlusions by replacing direct embedding alignments with a synthesis-based matching process. It transfers knowledge from a 2D foundation model into a 3D encoder so that rendered projections under estimated poses can be compared patch by patch to the input features. At inference the system searches jointly for the best-matching shape and its pose by minimizing the difference between the synthesized and observed feature maps. This formulation is presented as inherently more tolerant of missing object parts than holistic feed-forward methods and as a route to simultaneous pose and category outputs from one model.

Core claim

PASR formulates retrieval as a feature-level analysis-by-synthesis problem by distilling DINOv3 features into a 3D encoder. Pose-conditioned projections of 3D shapes are aligned with 2D feature maps during training. At inference, test-time optimization searches over shape identity and pose parameters to find the combination that best reconstructs the patch-level feature map of the input image, yielding substantial gains on both clean and occluded retrieval benchmarks together with competitive pose estimation and category classification.

What carries the argument

Pose-conditioned projection of distilled 3D features into 2D space, followed by test-time analysis-by-synthesis optimization that jointly varies shape identity and pose to match the input feature map.

Load-bearing premise

That the alignments produced by distilling 2D features through pose-conditioned 3D projections stay reliable when large parts of the object are hidden and that the optimization will reach the correct shape and pose rather than a poor local solution.

What would settle it

Running the test-time optimization on a held-out collection of real-world heavily occluded images and finding that the recovered shapes and poses frequently disagree with human ground truth or known 3D models would show the robustness claim does not hold.

Figures

Figures reproduced from arXiv: 2604.22658 by Adam Kortylewski, Alan Yuille, Guofeng Zhang, Jiaxin Shi, Naifu Liang, Wufei Ma.

**Figure 1.** Figure 1: PASR achieves competitive clean, occluded, and unseen view at source ↗

**Figure 2.** Figure 2: Overall architecture of PASR. (a) We distill semantic and spatial knowledge from a 2D foundation model into a 3D point cloud view at source ↗

**Figure 3.** Figure 3: Retrieval results on Pascal3D (L3). The ”GT” column presents the ground truth 3D shape and its corresponding ground truth 3D view at source ↗

**Figure 4.** Figure 4: Visualization of learned point-level features from 3D view at source ↗

**Figure 5.** Figure 5: Examples of the query images across different occlusion view at source ↗

**Figure 6.** Figure 6: Additional 3D shape retrieval visualizations on Pascal3D (L3). Ground Truth (GT) is shown for reference. view at source ↗

read the original abstract

Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PASR reframes occluded 3D retrieval as test-time joint search over shape and pose to match DINOv3 features, but the abstract's wide-margin claims rest on unshown optimization behavior and results.

read the letter

The main contribution here is casting retrieval as pose-conditioned analysis-by-synthesis: distill DINOv3 patch features into a 3D encoder via projected alignments, then at inference optimize over database shapes and poses to minimize feature-map error. This differs from the two feed-forward categories the abstract describes and gives a built-in way to handle partial views through explicit synthesis rather than holistic embeddings. The multi-task angle—pulling retrieval, pose, and classification from the same optimization—is a practical bonus if the numbers work out. Credit to the authors for making the inference step interpretable and for tying the 3D encoder directly to pose-aware projections instead of generic contrastive losses. That setup could genuinely help in robotics or AR settings where occlusion is routine. The soft spots are exactly where the stress-test note points. The abstract asserts substantial outperformance on clean and occluded datasets plus competitive pose and classification, yet supplies no numbers, no baseline tables, no ablation on the optimization loop, and no breakdown by occlusion ratio or initialization. Without those, it is impossible to tell whether the joint search reliably converges, how many iterations it needs, or whether it degrades when visibility is not explicitly modeled in the projections. The circularity concern also lands: because both training and test-time objectives are defined around DINOv3 alignment, gains may partly reflect how well that 2D model already tolerates occlusion rather than new robustness from the 3D side. If the shape search stays discrete over a fixed database, the method could still fail on fine-grained or heavily occluded cases that the abstract claims to solve. This paper is aimed at researchers working on single-view 3D understanding for real scenes. A reader who cares about whether test-time optimization can deliver measurable robustness would find the experiments and failure cases useful to examine. It deserves a serious referee to verify the optimization details and the actual margins; the formulation is distinct enough that the community should see the evidence rather than desk-reject on the abstract alone. I would send it to review.

Referee Report

3 major / 1 minor

Summary. The paper proposes PASR, a framework that distills DINOv3 2D features into a 3D encoder via pose-conditioned projections to enable single-view 3D shape retrieval. Retrieval is cast as test-time analysis-by-synthesis: joint optimization over a shape database and pose parameters minimizes the reconstruction error between the input image's DINOv3 feature map and the rendered 3D projection. The abstract claims wide-margin gains over prior methods on both clean and occluded retrieval benchmarks plus competitive multi-task performance in pose estimation and category classification.

Significance. If the empirical claims and optimization reliability hold, PASR would demonstrate that distilling foundation-model patch features into a pose-aware 3D encoder followed by test-time search can yield more robust and interpretable retrieval under occlusion than pure feed-forward embeddings. The multi-task framing and explicit use of analysis-by-synthesis are potentially valuable contributions to real-world 3D vision pipelines.

major comments (3)

[Abstract] Abstract: the central claim that PASR 'substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin' is stated without any quantitative metrics, baseline names, tables, or error bars, rendering the headline result unverifiable from the provided text.
[§3.3] §3.3 (Test-time optimization): no quantitative analysis is supplied on convergence behavior, sensitivity to initialization, iteration count, success rate, or failure modes across occlusion ratios; these are load-bearing for the robustness claim.
[§3.2] §3.2 (Pose-conditioned projections): the projection operator is not described as modeling visibility or occlusion masks; without such modeling the feature-map alignment may systematically degrade exactly in the partial-view regime the method claims to handle best.

minor comments (1)

[§3] Notation for the 3D encoder output and the feature-map loss could be clarified with an explicit equation relating the projected features to the DINOv3 patch tokens.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully considered each major comment and provide point-by-point responses below, indicating where revisions have been made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that PASR 'substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin' is stated without any quantitative metrics, baseline names, tables, or error bars, rendering the headline result unverifiable from the provided text.

Authors: We agree that the abstract would be strengthened by including specific quantitative results to support the performance claims. In the revised manuscript, we have updated the abstract to report key metrics, including retrieval accuracy improvements (with error bars from multiple runs) over named baselines on both clean and occluded datasets, along with direct references to the relevant tables and figures. revision: yes
Referee: [§3.3] §3.3 (Test-time optimization): no quantitative analysis is supplied on convergence behavior, sensitivity to initialization, iteration count, success rate, or failure modes across occlusion ratios; these are load-bearing for the robustness claim.

Authors: We acknowledge the value of additional empirical analysis on the test-time optimization. We have revised §3.3 and added an appendix with quantitative results, including average iterations to convergence, success rates under varying occlusion levels, sensitivity to pose initialization (random vs. coarse), and examples of failure modes. These additions provide direct support for the robustness claims. revision: yes
Referee: [§3.2] §3.2 (Pose-conditioned projections): the projection operator is not described as modeling visibility or occlusion masks; without such modeling the feature-map alignment may systematically degrade exactly in the partial-view regime the method claims to handle best.

Authors: The projection uses differentiable rendering of pose-conditioned 3D features aligned to DINOv3 patches. While explicit visibility or occlusion masks are not computed, the joint optimization over shape and pose implicitly handles partial views by minimizing feature error only on matching regions. In the revision, we have expanded §3.2 to clarify this mechanism, added discussion of implicit occlusion robustness, and included an ablation comparing performance with and without explicit mask modeling. We believe this addresses the concern without altering the core method. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method uses external DINOv3 and distinct train/inference steps

full rationale

The paper's core chain—distilling DINOv3 patch features into a 3D encoder via pose-conditioned projections, then performing test-time optimization to minimize feature-map reconstruction error for joint shape/pose retrieval—does not reduce to its own inputs by construction. The 2D foundation model is external and pretrained independently, the training objective aligns projections to those features, and inference searches a shape database under the same metric; these are sequential but non-circular steps. Performance claims rest on empirical results against baselines on clean/occluded datasets rather than self-definitional renaming or fitted parameters presented as predictions. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the abstract or description. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that DINOv3 patch features encode sufficient geometric information to guide 3D shape and pose search; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption DINOv3 features capture geometric structure usable for 3D mesh alignment
Invoked when the 3D encoder is trained to match pose-conditioned projections to 2D feature maps.

pith-pipeline@v0.9.0 · 5576 in / 1322 out tokens · 61855 ms · 2026-05-08T12:33:45.970924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 10 canonical work pages · 1 internal anchor

[1]

On visual similarity based 3d model retrieval

Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3d model retrieval. Computer Graphics Forum, 22(3):223–232, 2003. 2

2003
[2]

Splat-nav: Safe real-time robot navigation in gaussian splatting maps.IEEE Transactions on Robotics,

Timothy Chen, Ola Shorinwa, Joseph Bruno, Aiden Swann, Javier Yu, Weijia Zeng, Keiko Nagami, Philip Dames, and Mac Schwager. Splat-nav: Safe real-time robot navigation in gaussian splatting maps.IEEE Transactions on Robotics,
[3]

Open: Occlusion-invariant perception network for single image- based 3d shape retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):7998–8012, 2024

Fupeng Chu, Yang Cong, and Ronghan Chen. Open: Occlusion-invariant perception network for single image- based 3d shape retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):7998–8012, 2024. 2

2024
[4]

Cross-modal 3d shape retrieval via heterogeneous dynamic graph representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2370–2387, 2025

Yue Dai, Yifan Feng, Nan Ma, Xibin Zhao, and Yue Gao. Cross-modal 3d shape retrieval via heterogeneous dynamic graph representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2370–2387, 2025. 2

2025
[5]

Do it your- self: Learning semantic correspondence from pseudo-labels

Olaf D ¨unkel, Thomas Wimmer, Christian Theobalt, Chris- tian Rupprecht, and Adam Kortylewski. Do it your- self: Learning semantic correspondence from pseudo-labels. arXiv preprint arXiv:2506.05312, 2025. 8

work page arXiv 2025
[6]

Hard example generation by texture synthesis for cross-domain shape similarity learn- ing.Advances in neural information processing systems, 33: 14675–14687, 2020

Huan Fu, Shunming Li, Rongfei Jia, Mingming Gong, Bin- qiang Zhao, and Dacheng Tao. Hard example generation by texture synthesis for cross-domain shape similarity learn- ing.Advances in neural information processing systems, 33: 14675–14687, 2020. 1

2020
[7]

Lo- cation field descriptors: Single image 3d model retrieval in the wild

Alexander Grabner, Peter M Roth, and Vincent Lepetit. Lo- cation field descriptors: Single image 3d model retrieval in the wild. In2019 international conference on 3d vision (3DV), pages 583–593. IEEE, 2019. 1

2019
[8]

Zhizhong Han, Xiyang Zhang, Chuhang Chen, Junwei Han, Alexander G Schwing, and Jingtuo Liu. 3d2seqviews: Ag- gregating sequential views for 3d global feature learning by cnn with hierarchical attention aggregation.IEEE Transac- tions on Image Processing, 28(8):3986–3999, 2019. 2

2019
[9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5

2016
[10]

Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition.arXiv preprint arXiv:2303.11313,

Deepti Hegde, Jeya Maria Jose Valanarasu, and Vishal M Patel. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition.arXiv preprint arXiv:2303.11313,

work page arXiv
[11]

Cross-domain image-object retrieval based on weighted optimal transport.IEEE Transactions on Multime- dia, 25:9557–9571, 2023

Nian Hu, Xiangdong Huang, Wenhui Li, Xuanya Li, and An-An Liu. Cross-domain image-object retrieval based on weighted optimal transport.IEEE Transactions on Multime- dia, 25:9557–9571, 2023. 2

2023
[12]

Novum: Neural object volumes for robust object classification

Artur Jesslen, Guofeng Zhang, Angtian Wang, Wufei Ma, Alan Yuille, and Adam Kortylewski. Novum: Neural object volumes for robust object classification. InEuropean Con- ference on Computer Vision, pages 264–281. Springer, 2024. 2

2024
[13]

Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval

Jianwen Jiang, Yiqun Xie, Lei Li, and Xin Liu. Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 8513–8520, 2019. 2

2019
[14]

Compositional convolutional neural networks: A deep archi- tecture with innate robustness to partial occlusion

Adam Kortylewski, Ju He, Qing Liu, and Alan L Yuille. Compositional convolutional neural networks: A deep archi- tecture with innate robustness to partial occlusion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8940–8949, 2020. 2

2020
[15]

Single image 3d shape retrieval via cross-modal instance and category contrastive learning

Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. Single image 3d shape retrieval via cross-modal instance and category contrastive learning. InInternational Conference on Computer Vision (ICCV), 2021. 1

2021
[16]

Single image 3d shape retrieval via cross-modal instance and category contrastive learning

Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. Single image 3d shape retrieval via cross-modal instance and category contrastive learning. InProceedings of the International Conference on Computer Vision (ICCV), pages 11405–11415, 2021. 1, 2, 5

2021
[17]

Openshape: Scaling up 3d shape representation towards open-world understanding, 2023

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding, 2023. 1, 2, 5

2023
[18]

Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models

Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21736–21746, 2023. 2

2023
[19]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5

2021
[20]

Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features

Wufei Ma, Angtian Wang, Alan Yuille, and Adam Ko- rtylewski. Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features. InEuropean Conference on Computer Vision, pages 492–508. Springer,
[21]

Ima- genet3d: Towards general-purpose object-level 3d under- standing.Advances in Neural Information Processing Sys- tems, 38, 2024

Wufei Ma, Guofeng Zhang, Qihao Liu, Guanning Zeng, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Ima- genet3d: Towards general-purpose object-level 3d under- standing.Advances in Neural Information Processing Sys- tems, 38, 2024. 5

2024
[22]

Spatialreasoner: To- wards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: To- wards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025. 1

work page arXiv 2025
[23]

Templates for 3d object pose estimation revisited: Generalization to new objects and ro- bustness to occlusions

Van Nguyen Nguyen, Yinlin Hu, Yang Xiao, Mathieu Salz- mann, and Vincent Lepetit. Templates for 3d object pose estimation revisited: Generalization to new objects and ro- bustness to occlusions. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 6771–6780, 2022. 2

2022
[24]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

2023
[25]

Domain adaptive 3d shape retrieval from monocular images

Harsh Pal, Ritwik Khandelwal, Shivam Pande, Biplab Baner- jee, and Srikrishna Karanam. Domain adaptive 3d shape retrieval from monocular images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3192–3201, 2024. 2

2024
[26]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu ”Max” Jiang, An- drea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

2023
[27]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural informa- tion processing systems, 35:23192–23204, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural informa- tion processing systems, 35:23192–23204, 2022. 5

2022
[28]

Chain of semantics programming in 3d gaussian splatting representation for 3d vision grounding

Jiaxin Shi, Mingyue Xiang, Hao Sun, Yixuan Huang, and Zhi Weng. Chain of semantics programming in 3d gaussian splatting representation for 3d vision grounding. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 24560–24569, 2025. 1

2025
[29]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5, 1

work page internal anchor Pith review arXiv 2025
[30]

Domain-specific modeling and se- mantic alignment for image-based 3d model retrieval.Com- puters & Graphics, 115:25–34, 2023

Dan Song, Xue-Jing Jiang, Yue Zhang, Fang-Lue Zhang, Yao Jin, and Yun Zhang. Domain-specific modeling and se- mantic alignment for image-based 3d model retrieval.Com- puters & Graphics, 115:25–34, 2023. 2

2023
[31]

Cross-modal contrastive learn- ing with a style-mixed bridge for single image 3d shape re- trieval.ACM Trans

Dan Song, Shumeng Huo, Xinwei Fu, Chumeng Zhang, Wenhui Li, and An-An Liu. Cross-modal contrastive learn- ing with a style-mixed bridge for single image 3d shape re- trieval.ACM Trans. Multimedia Comput. Commun. Appl.,
[32]

Adaptive clip for open-domain 3d model retrieval.Information Processing & Management, 62(2):103989, 2025

Dan Song, Zekai Qiang, Chumeng Zhang, Lanjun Wang, Qiong Liu, You Yang, and An-An Liu. Adaptive clip for open-domain 3d model retrieval.Information Processing & Management, 62(2):103989, 2025. 2

2025
[33]

Pix3d: Dataset and methods for single-image 3d shape modeling

Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2974–2983, 2018. 5, 1

2018
[34]

A survey of content based 3d shape retrieval methods.Proceedings Shape Modeling Applications, 2004., pages 145–156, 2004

Johan WH Tangelder and Remco C Veltkamp. A survey of content based 3d shape retrieval methods.Proceedings Shape Modeling Applications, 2004., pages 145–156, 2004. 1

2004
[35]

Nemo: Neural mesh models of contrastive features for robust 3d pose estimation.arXiv preprint arXiv:2101.12378, 2021

Angtian Wang, Adam Kortylewski, and Alan Yuille. Nemo: Neural mesh models of contrastive features for robust 3d pose estimation.arXiv preprint arXiv:2101.12378, 2021. 2, 5

work page arXiv 2021
[36]

Back to 3d: Few-shot 3d keypoint detection with back-projected 2d features

Thomas Wimmer, Peter Wonka, and Maks Ovsjanikov. Back to 3d: Few-shot 3d keypoint detection with back-projected 2d features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4154–4164,
[37]

Generalizing Single-View 3D Shape Retrieval to Oc- clusions and Unseen Objects

Qirui Wu, Daniel Ritchie, Manolis Savva, and Angel.X Chang. Generalizing Single-View 3D Shape Retrieval to Oc- clusions and Unseen Objects. 2023. 1, 2

2023
[38]

Towards large- scale 3d representation learning with multi-dataset point prompt training

Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InCVPR, 2024. 2

2024
[39]

Sonata: Self- supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. In CVPR, 2025. 2

2025
[40]

Beyond pascal: A benchmark for 3d object detection in the wild

Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014. 5, 1

2014
[41]

A survey on deep geometry learning: From a representation perspective.Computational Visual Media, 6 (2):113–133, 2020

Yun-Peng Xiao, Yu-Kun Lai, Fang-Lue Zhang, Chunpeng Li, and Lin Gao. A survey on deep geometry learning: From a representation perspective.Computational Visual Media, 6 (2):113–133, 2020. 1

2020
[42]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 1, 2

2023
[43]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 1, 2

2024
[44]

Robotron-mani: All-in-one multimodal large model for robotic manipulation, 2025

Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, and Lin Ma. Robotron-mani: All-in-one multimodal large model for robotic manipulation.arXiv preprint arXiv:2412.07215,

work page arXiv
[45]

Scaling 3d compositional models for robust classification and pose es- timation

Xiaoding Yuan, Guofeng Zhang, Prakhar Kaushik, Artur Jesslen, Adam Kortylewski, and Alan Yuille. Scaling 3d compositional models for robust classification and pose es- timation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6406–6415,
[46]

Vision as bayesian infer- ence: analysis by synthesis?Trends in cognitive sciences, 10(7):301–308, 2006

Alan Yuille and Daniel Kersten. Vision as bayesian infer- ence: analysis by synthesis?Trends in cognitive sciences, 10(7):301–308, 2006. 1

2006
[47]

Pointclip: Point cloud understanding by clip.arXiv preprint arXiv:2112.02413, 2021

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip.arXiv preprint arXiv:2112.02413, 2021. 2

work page arXiv 2021
[48]

Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025. 2

work page arXiv 2025
[49]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. InInternational Conference on Learning Representations (ICLR), 2024. 1, 2, 5

2024
[50]

Point- clip v2: Adapting clip for powerful 3d open-world learning,

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning.arXiv preprint arXiv:2211.11682, 2022. 2 PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views Supplementary Material A. Additional Implementation Details 2D Encoder.We ...

work page arXiv 2022
[51]

Under severe occlusion (L3), the pose estimation accuracy is inherently bounded by the retrieval quality

On Pix3D, our method outperforms the baselines under L0, L1, and L2 occlusion. Under severe occlusion (L3), the pose estimation accuracy is inherently bounded by the retrieval quality. Since our method retrieves the 3D shape first, incorrect shape retrieval in extreme occlusion scenar- ios leads to higher pose errors, resulting in performance lower than t...

work page arXiv 2071