arxiv: 2604.22826 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.LG

Recognition: unknown

Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

Bayangmbe Mounmo , Sam Chien , Mile Mitrovic

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords self-supervised learning3D geometryCAD meshesfoundation modelmesh reconstructioncontrastive consistencytransformer

0 comments

The pith

Shape is a self-supervised model that produces dense 3D embeddings from CAD meshes using masked token reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Shape, a self-supervised foundation model that turns CAD surface meshes into dense per-token embeddings for industrial use. It employs a structured latent grid, a multi-scale tokenizer with cross-attention, and a transformer processor to handle the geometry. Pretraining occurs via masked reconstruction of normalized per-dimension geometry statistics and multi-resolution contrastive consistency losses on over 61 thousand meshes from public collections. This setup is intended to produce generalizable embeddings that allow accurate reconstruction, high-accuracy retrieval, and explainable attributions without task-specific labels.

Core claim

Shape establishes that self-supervised pretraining with masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency allows a transformer-based model to learn embeddings from CAD meshes that generalize well, as evidenced by strong performance on reconstruction and retrieval tasks on held-out data with little overfitting.

What carries the argument

Masked token reconstruction of normalized geometry statistics combined with multi-resolution contrastive consistency on a transformer backbone that includes a multi-scale geometry-aware tokenizer.

Load-bearing premise

That the combination of masked reconstruction and contrastive consistency on the chosen CAD datasets produces embeddings generalizing robustly to unseen industrial CAD analysis tasks.

What would settle it

Evaluating the model on a new collection of industrial CAD meshes from a different source and observing significantly lower reconstruction accuracy or retrieval performance than reported would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.22826 by Bayangmbe Mounmo, Mile Mitrovic, Sam Chien.

**Figure 2.** Figure 2: Alignment-uniformity diagnostic [8]. Alignment is the expected squared Euclidean distance between embeddings of positive pairs; uniformity is the logarithm of the expected exponentially-decayed pairwise distance between random pairs. Lower values on both axes indicate a better contrastive representation [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shape delivers a clean self-supervised pipeline for CAD meshes with useful ablations and released artifacts, but its evaluations stay on proxy reconstruction and retrieval without any downstream industrial task tests.

read the letter

The paper builds a 10.9M-parameter transformer that turns surface meshes into per-token embeddings using a structured 3D latent grid, a cross-attention MAGNO tokenizer, and grouped-query attention. Pretraining combines masked reconstruction of per-dimension normalized geometry statistics with multi-resolution contrastive consistency, trained on 61k meshes from Thingi10K, MFCAD, and Fusion360. On a 2,983-mesh held-out split it reports R2=0.729 for reconstruction and 98.1% top-1 retrieval, with near-zero train/val gap and a 2x2 ablation that isolates normalization as the key factor (performance drops sharply without it). Code, embeddings, and a demo are released, which makes the numbers checkable. That combination of architecture details, loss mix, and controlled ablation is the concrete new piece; prior self-supervised mesh work exists, but this specific stack tailored to CAD meshes with the normalization focus is not already in the cited literature. The ablation itself is straightforward and informative, showing why the normalization step matters for stability. The main limitation is that all reported results are proxy objectives on data drawn from the same sources used for pretraining. No experiments appear on actual industrial CAD tasks such as manufacturing feature classification, part segmentation, tolerance checking, or assembly reasoning. The foundation-model framing for industrial workflows therefore rests on an untested transfer assumption. Readers working on self-supervised 3D mesh representations will find the implementation details and ablation useful for their own pipelines. The work is coherent on its own terms and ships verifiable artifacts, so it is worth sending to peer review even though the downstream evidence is still missing.

Referee Report

1 major / 1 minor

Summary. The paper introduces Shape, a self-supervised 3D geometry foundation model that converts surface meshes into dense per-token embeddings using a structured 3D latent grid, the MAGNO tokenizer with cross-attention, and a transformer processor. Pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360 via masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. On a held-out split of 2,983 meshes, it achieves reconstruction R² = 0.729 and 98.1% top-1 retrieval, with an ablation showing per-dimension normalization is critical for performance.

Significance. If the embeddings generalize beyond the proxy tasks, Shape could serve as a useful foundation model for industrial CAD analysis, supported by the released code, embeddings, and the 2x2 ablation isolating the role of normalization. The near-zero train/val gap and concrete metrics strengthen the internal validity of the pretraining approach.

major comments (1)

[Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.

minor comments (1)

[Abstract] The reconstruction metric is written as 'R2'; it should be formatted as R² for mathematical clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review, the recognition of the model's internal validity, the near-zero train/val gap, the ablation isolating normalization, and the value of the released code and embeddings. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.

Authors: We agree that the abstract's framing assumes the embeddings will prove useful for downstream industrial tasks and that direct evidence on tasks such as manufacturing feature classification, part segmentation, or assembly reasoning would strengthen the central claim. The current results focus on self-supervised pretraining quality via masked normalized geometry reconstruction (R² = 0.729) and multi-resolution contrastive retrieval (98.1% top-1) on held-out meshes drawn from the same industrial CAD sources (Thingi10K, MFCAD, Fusion360). These proxies directly measure geometric fidelity and discriminative power, which are prerequisites for the cited downstream applications; the 2×2 ablation further isolates that per-dimension normalization is essential for both objectives. In the foundation-model literature, such proxy metrics on held-out data are standard for validating the backbone prior to transfer studies. The released embeddings and code are explicitly provided to support exactly those follow-on evaluations. To address the concern without overstating current evidence, we will revise the abstract to state that the model supplies dense per-token embeddings validated on reconstruction and retrieval proxies and is intended as a foundation for industrial CAD analysis, while adding a dedicated limitations paragraph that explicitly notes the absence of downstream task results and the reliance on proxy generalization. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pretraining and held-out evaluation

full rationale

The paper describes a self-supervised transformer model pretrained via masked reconstruction and contrastive losses on public CAD datasets, with performance reported as R2=0.729 and 98.1% top-1 retrieval on a held-out split of 2,983 meshes plus an empirical 2x2 ablation on normalization. No equations, derivations, or self-citations are presented that reduce any claimed result to fitted parameters or prior author work by construction. All metrics are computed on independent test data under standard protocols, satisfying the criteria for a self-contained empirical result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The performance claims rest on 10.9M learned parameters optimized via self-supervised objectives on the specified CAD datasets, plus design choices (normalization, loss type) validated only by the reported ablation.

free parameters (2)

10.9M backbone parameters
Weights of the tokenizer, latent grid, and transformer are fitted during pretraining on 61k meshes.
Per-dimension normalization scales
Ablation shows these are critical; they are either learned or computed from training statistics.

axioms (2)

domain assumption Masked token reconstruction of normalized geometry statistics yields useful dense embeddings
Core self-supervised objective stated in the abstract.
domain assumption Multi-resolution contrastive consistency improves geometric representation quality
Second pretraining objective listed in the abstract.

invented entities (1)

MAGNO tokenizer no independent evidence
purpose: Multi-scale geometry-aware tokenization via cross-attention on the latent grid
New component introduced to process 3D meshes at multiple scales.

pith-pipeline@v0.9.0 · 5561 in / 1710 out tokens · 67221 ms · 2026-05-10T06:11:55.196605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Point-BERT: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InCVPR, 2022

2022
[2]

Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InECCV, 2022. 14

2022
[3]

Geometry-aware operator transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains.arXiv preprint arXiv:2505.18781, 2025

Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Yizhou Zhao, Praveen Chandrashekar, and Siddhartha Mishra. Geometry aware operator transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains. InNeurIPS, 2025.https://arxiv.org/abs/2505.18781

work page arXiv 2025
[4]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

2017
[5]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023

2023
[6]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019

2019
[7]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021

work page internal anchor Pith review arXiv 2021
[8]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InICML, 2020

2020
[9]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015

2015
[10]

BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019

2019
[11]

Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018

2018
[12]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021

2021
[13]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

2022
[14]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

2024
[15]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020

2020
[16]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. InarXiv:2003.04297, 2020

work page internal anchor Pith review arXiv 2003
[17]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017

2017
[19]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. InNeurIPS, 2017

2017
[20]

Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In NeurIPS, 2022

2022
[21]

Neural Operator: Graph Kernel Network for Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv:2003.03485, 2020

work page internal anchor Pith review arXiv 2003
[22]

ABC: A big CAD model dataset for geometric deep learning

Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. ABC: A big CAD model dataset for geometric deep learning. InCVPR, 2019

2019
[23]

pseudo-genus

Qingnan Zhou and Alec Jacobson. Thingi10K: A dataset of 10,000 3d-printing models. arXiv:1605.04797, 2016. 15

work page arXiv 2016
[24]

Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G

Karl D.D. Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G. Lambourne, Armando Solar- Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic CAD construction from human design sequences. InACM Transactions on Graphics (SIGGRAPH), 2021

2021
[25]

Hierarchical CADnet: Learning from b-reps for machining feature recognition

Andrew Colligan, Trevor Robinson, Declan Nolan, Yang Hua, and Wanbin Cao. Hierarchical CADnet: Learning from b-reps for machining feature recognition. InComputer-Aided Design, 2022

2022
[26]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

2023
[27]

Chang, Li Yi, Subarna Tripathi, Leonidas J

Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InCVPR, 2019

2019
[28]

Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964

1964
[29]

Fast R-CNN

Ross Girshick. Fast R-CNN. InICCV, 2015

2015
[30]

From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023

Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023

2023
[31]

Kosiorek, Seungjin Choi, and Yee Whye Teh

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML, 2019

2019
[32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. A Full training hyperparameters Table 6 reports the complete set of hyperparameters used to train the released Shape (small) model. A compact summary of architecture parameters only is given in the main body (Section 4). B DDP and training infrastructure The implementat...

2019