Recognition: unknown
PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
Pith reviewed 2026-05-08 12:33 UTC · model grok-4.3
The pith
A pose-aware analysis-by-synthesis approach retrieves 3D shapes from single occluded images by distilling 2D features into a 3D encoder and optimizing over shape and pose at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PASR formulates retrieval as a feature-level analysis-by-synthesis problem by distilling DINOv3 features into a 3D encoder. Pose-conditioned projections of 3D shapes are aligned with 2D feature maps during training. At inference, test-time optimization searches over shape identity and pose parameters to find the combination that best reconstructs the patch-level feature map of the input image, yielding substantial gains on both clean and occluded retrieval benchmarks together with competitive pose estimation and category classification.
What carries the argument
Pose-conditioned projection of distilled 3D features into 2D space, followed by test-time analysis-by-synthesis optimization that jointly varies shape identity and pose to match the input feature map.
Load-bearing premise
That the alignments produced by distilling 2D features through pose-conditioned 3D projections stay reliable when large parts of the object are hidden and that the optimization will reach the correct shape and pose rather than a poor local solution.
What would settle it
Running the test-time optimization on a held-out collection of real-world heavily occluded images and finding that the recovered shapes and poses frequently disagree with human ground truth or known 3D models would show the robustness claim does not hold.
Figures
read the original abstract
Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PASR, a framework that distills DINOv3 2D features into a 3D encoder via pose-conditioned projections to enable single-view 3D shape retrieval. Retrieval is cast as test-time analysis-by-synthesis: joint optimization over a shape database and pose parameters minimizes the reconstruction error between the input image's DINOv3 feature map and the rendered 3D projection. The abstract claims wide-margin gains over prior methods on both clean and occluded retrieval benchmarks plus competitive multi-task performance in pose estimation and category classification.
Significance. If the empirical claims and optimization reliability hold, PASR would demonstrate that distilling foundation-model patch features into a pose-aware 3D encoder followed by test-time search can yield more robust and interpretable retrieval under occlusion than pure feed-forward embeddings. The multi-task framing and explicit use of analysis-by-synthesis are potentially valuable contributions to real-world 3D vision pipelines.
major comments (3)
- [Abstract] Abstract: the central claim that PASR 'substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin' is stated without any quantitative metrics, baseline names, tables, or error bars, rendering the headline result unverifiable from the provided text.
- [§3.3] §3.3 (Test-time optimization): no quantitative analysis is supplied on convergence behavior, sensitivity to initialization, iteration count, success rate, or failure modes across occlusion ratios; these are load-bearing for the robustness claim.
- [§3.2] §3.2 (Pose-conditioned projections): the projection operator is not described as modeling visibility or occlusion masks; without such modeling the feature-map alignment may systematically degrade exactly in the partial-view regime the method claims to handle best.
minor comments (1)
- [§3] Notation for the 3D encoder output and the feature-map loss could be clarified with an explicit equation relating the projected features to the DINOv3 patch tokens.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We have carefully considered each major comment and provide point-by-point responses below, indicating where revisions have been made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that PASR 'substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin' is stated without any quantitative metrics, baseline names, tables, or error bars, rendering the headline result unverifiable from the provided text.
Authors: We agree that the abstract would be strengthened by including specific quantitative results to support the performance claims. In the revised manuscript, we have updated the abstract to report key metrics, including retrieval accuracy improvements (with error bars from multiple runs) over named baselines on both clean and occluded datasets, along with direct references to the relevant tables and figures. revision: yes
-
Referee: [§3.3] §3.3 (Test-time optimization): no quantitative analysis is supplied on convergence behavior, sensitivity to initialization, iteration count, success rate, or failure modes across occlusion ratios; these are load-bearing for the robustness claim.
Authors: We acknowledge the value of additional empirical analysis on the test-time optimization. We have revised §3.3 and added an appendix with quantitative results, including average iterations to convergence, success rates under varying occlusion levels, sensitivity to pose initialization (random vs. coarse), and examples of failure modes. These additions provide direct support for the robustness claims. revision: yes
-
Referee: [§3.2] §3.2 (Pose-conditioned projections): the projection operator is not described as modeling visibility or occlusion masks; without such modeling the feature-map alignment may systematically degrade exactly in the partial-view regime the method claims to handle best.
Authors: The projection uses differentiable rendering of pose-conditioned 3D features aligned to DINOv3 patches. While explicit visibility or occlusion masks are not computed, the joint optimization over shape and pose implicitly handles partial views by minimizing feature error only on matching regions. In the revision, we have expanded §3.2 to clarify this mechanism, added discussion of implicit occlusion robustness, and included an ablation comparing performance with and without explicit mask modeling. We believe this addresses the concern without altering the core method. revision: partial
Circularity Check
No significant circularity; method uses external DINOv3 and distinct train/inference steps
full rationale
The paper's core chain—distilling DINOv3 patch features into a 3D encoder via pose-conditioned projections, then performing test-time optimization to minimize feature-map reconstruction error for joint shape/pose retrieval—does not reduce to its own inputs by construction. The 2D foundation model is external and pretrained independently, the training objective aligns projections to those features, and inference searches a shape database under the same metric; these are sequential but non-circular steps. Performance claims rest on empirical results against baselines on clean/occluded datasets rather than self-definitional renaming or fitted parameters presented as predictions. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the abstract or description. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DINOv3 features capture geometric structure usable for 3D mesh alignment
Reference graph
Works this paper leans on
-
[1]
On visual similarity based 3d model retrieval
Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3d model retrieval. Computer Graphics Forum, 22(3):223–232, 2003. 2
2003
-
[2]
Splat-nav: Safe real-time robot navigation in gaussian splatting maps.IEEE Transactions on Robotics,
Timothy Chen, Ola Shorinwa, Joseph Bruno, Aiden Swann, Javier Yu, Weijia Zeng, Keiko Nagami, Philip Dames, and Mac Schwager. Splat-nav: Safe real-time robot navigation in gaussian splatting maps.IEEE Transactions on Robotics,
-
[3]
Open: Occlusion-invariant perception network for single image- based 3d shape retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):7998–8012, 2024
Fupeng Chu, Yang Cong, and Ronghan Chen. Open: Occlusion-invariant perception network for single image- based 3d shape retrieval.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):7998–8012, 2024. 2
2024
-
[4]
Cross-modal 3d shape retrieval via heterogeneous dynamic graph representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2370–2387, 2025
Yue Dai, Yifan Feng, Nan Ma, Xibin Zhao, and Yue Gao. Cross-modal 3d shape retrieval via heterogeneous dynamic graph representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2370–2387, 2025. 2
2025
-
[5]
Do it your- self: Learning semantic correspondence from pseudo-labels
Olaf D ¨unkel, Thomas Wimmer, Christian Theobalt, Chris- tian Rupprecht, and Adam Kortylewski. Do it your- self: Learning semantic correspondence from pseudo-labels. arXiv preprint arXiv:2506.05312, 2025. 8
-
[6]
Hard example generation by texture synthesis for cross-domain shape similarity learn- ing.Advances in neural information processing systems, 33: 14675–14687, 2020
Huan Fu, Shunming Li, Rongfei Jia, Mingming Gong, Bin- qiang Zhao, and Dacheng Tao. Hard example generation by texture synthesis for cross-domain shape similarity learn- ing.Advances in neural information processing systems, 33: 14675–14687, 2020. 1
2020
-
[7]
Lo- cation field descriptors: Single image 3d model retrieval in the wild
Alexander Grabner, Peter M Roth, and Vincent Lepetit. Lo- cation field descriptors: Single image 3d model retrieval in the wild. In2019 international conference on 3d vision (3DV), pages 583–593. IEEE, 2019. 1
2019
-
[8]
Zhizhong Han, Xiyang Zhang, Chuhang Chen, Junwei Han, Alexander G Schwing, and Jingtuo Liu. 3d2seqviews: Ag- gregating sequential views for 3d global feature learning by cnn with hierarchical attention aggregation.IEEE Transac- tions on Image Processing, 28(8):3986–3999, 2019. 2
2019
-
[9]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5
2016
-
[10]
Deepti Hegde, Jeya Maria Jose Valanarasu, and Vishal M Patel. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition.arXiv preprint arXiv:2303.11313,
-
[11]
Cross-domain image-object retrieval based on weighted optimal transport.IEEE Transactions on Multime- dia, 25:9557–9571, 2023
Nian Hu, Xiangdong Huang, Wenhui Li, Xuanya Li, and An-An Liu. Cross-domain image-object retrieval based on weighted optimal transport.IEEE Transactions on Multime- dia, 25:9557–9571, 2023. 2
2023
-
[12]
Novum: Neural object volumes for robust object classification
Artur Jesslen, Guofeng Zhang, Angtian Wang, Wufei Ma, Alan Yuille, and Adam Kortylewski. Novum: Neural object volumes for robust object classification. InEuropean Con- ference on Computer Vision, pages 264–281. Springer, 2024. 2
2024
-
[13]
Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval
Jianwen Jiang, Yiqun Xie, Lei Li, and Xin Liu. Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 8513–8520, 2019. 2
2019
-
[14]
Compositional convolutional neural networks: A deep archi- tecture with innate robustness to partial occlusion
Adam Kortylewski, Ju He, Qing Liu, and Alan L Yuille. Compositional convolutional neural networks: A deep archi- tecture with innate robustness to partial occlusion. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8940–8949, 2020. 2
2020
-
[15]
Single image 3d shape retrieval via cross-modal instance and category contrastive learning
Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. Single image 3d shape retrieval via cross-modal instance and category contrastive learning. InInternational Conference on Computer Vision (ICCV), 2021. 1
2021
-
[16]
Single image 3d shape retrieval via cross-modal instance and category contrastive learning
Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. Single image 3d shape retrieval via cross-modal instance and category contrastive learning. InProceedings of the International Conference on Computer Vision (ICCV), pages 11405–11415, 2021. 1, 2, 5
2021
-
[17]
Openshape: Scaling up 3d shape representation towards open-world understanding, 2023
Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding, 2023. 1, 2, 5
2023
-
[18]
Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models
Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21736–21746, 2023. 2
2023
-
[19]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5
2021
-
[20]
Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features
Wufei Ma, Angtian Wang, Alan Yuille, and Adam Ko- rtylewski. Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features. InEuropean Conference on Computer Vision, pages 492–508. Springer,
-
[21]
Ima- genet3d: Towards general-purpose object-level 3d under- standing.Advances in Neural Information Processing Sys- tems, 38, 2024
Wufei Ma, Guofeng Zhang, Qihao Liu, Guanning Zeng, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Ima- genet3d: Towards general-purpose object-level 3d under- standing.Advances in Neural Information Processing Sys- tems, 38, 2024. 5
2024
-
[22]
Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: To- wards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025. 1
-
[23]
Templates for 3d object pose estimation revisited: Generalization to new objects and ro- bustness to occlusions
Van Nguyen Nguyen, Yinlin Hu, Yang Xiao, Mathieu Salz- mann, and Vincent Lepetit. Templates for 3d object pose estimation revisited: Generalization to new objects and ro- bustness to occlusions. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 6771–6780, 2022. 2
2022
-
[24]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
2023
-
[25]
Domain adaptive 3d shape retrieval from monocular images
Harsh Pal, Ritwik Khandelwal, Shivam Pande, Biplab Baner- jee, and Srikrishna Karanam. Domain adaptive 3d shape retrieval from monocular images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3192–3201, 2024. 2
2024
-
[26]
Openscene: 3d scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu ”Max” Jiang, An- drea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2
2023
-
[27]
Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural informa- tion processing systems, 35:23192–23204, 2022
Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.Advances in neural informa- tion processing systems, 35:23192–23204, 2022. 5
2022
-
[28]
Chain of semantics programming in 3d gaussian splatting representation for 3d vision grounding
Jiaxin Shi, Mingyue Xiang, Hao Sun, Yixuan Huang, and Zhi Weng. Chain of semantics programming in 3d gaussian splatting representation for 3d vision grounding. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 24560–24569, 2025. 1
2025
-
[29]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5, 1
work page internal anchor Pith review arXiv 2025
-
[30]
Domain-specific modeling and se- mantic alignment for image-based 3d model retrieval.Com- puters & Graphics, 115:25–34, 2023
Dan Song, Xue-Jing Jiang, Yue Zhang, Fang-Lue Zhang, Yao Jin, and Yun Zhang. Domain-specific modeling and se- mantic alignment for image-based 3d model retrieval.Com- puters & Graphics, 115:25–34, 2023. 2
2023
-
[31]
Cross-modal contrastive learn- ing with a style-mixed bridge for single image 3d shape re- trieval.ACM Trans
Dan Song, Shumeng Huo, Xinwei Fu, Chumeng Zhang, Wenhui Li, and An-An Liu. Cross-modal contrastive learn- ing with a style-mixed bridge for single image 3d shape re- trieval.ACM Trans. Multimedia Comput. Commun. Appl.,
-
[32]
Adaptive clip for open-domain 3d model retrieval.Information Processing & Management, 62(2):103989, 2025
Dan Song, Zekai Qiang, Chumeng Zhang, Lanjun Wang, Qiong Liu, You Yang, and An-An Liu. Adaptive clip for open-domain 3d model retrieval.Information Processing & Management, 62(2):103989, 2025. 2
2025
-
[33]
Pix3d: Dataset and methods for single-image 3d shape modeling
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2974–2983, 2018. 5, 1
2018
-
[34]
A survey of content based 3d shape retrieval methods.Proceedings Shape Modeling Applications, 2004., pages 145–156, 2004
Johan WH Tangelder and Remco C Veltkamp. A survey of content based 3d shape retrieval methods.Proceedings Shape Modeling Applications, 2004., pages 145–156, 2004. 1
2004
-
[35]
Angtian Wang, Adam Kortylewski, and Alan Yuille. Nemo: Neural mesh models of contrastive features for robust 3d pose estimation.arXiv preprint arXiv:2101.12378, 2021. 2, 5
-
[36]
Back to 3d: Few-shot 3d keypoint detection with back-projected 2d features
Thomas Wimmer, Peter Wonka, and Maks Ovsjanikov. Back to 3d: Few-shot 3d keypoint detection with back-projected 2d features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4154–4164,
-
[37]
Generalizing Single-View 3D Shape Retrieval to Oc- clusions and Unseen Objects
Qirui Wu, Daniel Ritchie, Manolis Savva, and Angel.X Chang. Generalizing Single-View 3D Shape Retrieval to Oc- clusions and Unseen Objects. 2023. 1, 2
2023
-
[38]
Towards large- scale 3d representation learning with multi-dataset point prompt training
Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InCVPR, 2024. 2
2024
-
[39]
Sonata: Self- supervised learning of reliable point representations
Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. In CVPR, 2025. 2
2025
-
[40]
Beyond pascal: A benchmark for 3d object detection in the wild
Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014. 5, 1
2014
-
[41]
A survey on deep geometry learning: From a representation perspective.Computational Visual Media, 6 (2):113–133, 2020
Yun-Peng Xiao, Yu-Kun Lai, Fang-Lue Zhang, Chunpeng Li, and Lin Gao. A survey on deep geometry learning: From a representation perspective.Computational Visual Media, 6 (2):113–133, 2020. 1
2020
-
[42]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 1, 2
2023
-
[43]
Ulip-2: Towards scalable multimodal pre-training for 3d understanding
Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 1, 2
2024
-
[44]
Robotron-mani: All-in-one multimodal large model for robotic manipulation, 2025
Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, and Lin Ma. Robotron-mani: All-in-one multimodal large model for robotic manipulation.arXiv preprint arXiv:2412.07215,
-
[45]
Scaling 3d compositional models for robust classification and pose es- timation
Xiaoding Yuan, Guofeng Zhang, Prakhar Kaushik, Artur Jesslen, Adam Kortylewski, and Alan Yuille. Scaling 3d compositional models for robust classification and pose es- timation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6406–6415,
-
[46]
Vision as bayesian infer- ence: analysis by synthesis?Trends in cognitive sciences, 10(7):301–308, 2006
Alan Yuille and Daniel Kersten. Vision as bayesian infer- ence: analysis by synthesis?Trends in cognitive sciences, 10(7):301–308, 2006. 1
2006
-
[47]
Pointclip: Point cloud understanding by clip.arXiv preprint arXiv:2112.02413, 2021
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip.arXiv preprint arXiv:2112.02413, 2021. 2
-
[48]
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025. 2
-
[49]
Uni3d: Exploring unified 3d representation at scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. InInternational Conference on Learning Representations (ICLR), 2024. 1, 2, 5
2024
-
[50]
Point- clip v2: Adapting clip for powerful 3d open-world learning,
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning.arXiv preprint arXiv:2211.11682, 2022. 2 PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views Supplementary Material A. Additional Implementation Details 2D Encoder.We ...
-
[51]
On Pix3D, our method outperforms the baselines under L0, L1, and L2 occlusion. Under severe occlusion (L3), the pose estimation accuracy is inherently bounded by the retrieval quality. Since our method retrieves the 3D shape first, incorrect shape retrieval in extreme occlusion scenar- ios leads to higher pose errors, resulting in performance lower than t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.