arxiv: 2604.04050 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

TORA: Topological Representation Alignment for 3D Shape Assembly

Nahyuk Lee , Zhiang Chen , Marc Pollefeys , Sunghwan Hong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords 3D shape assemblyflow matchingrepresentation alignmenttopological alignmentzero-shot transferpretrained 3D encoderspart assemblydomain shift robustness

0 comments

The pith

TORA aligns flow models to frozen 3D encoders to speed assembly training and improve accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TORA as a framework that aligns internal representations of a flow-matching model for 3D shape assembly with those from a frozen pretrained encoder. This transfers geometric and contact-based relational information to guide part movements during assembly. The method requires no extra computation at inference time yet produces faster training convergence up to 6.9 times, higher accuracy on benchmarks, and stronger results when applied to previously unseen real or synthetic data.

Core claim

TORA introduces a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. It realizes this first through token-wise cosine matching of learned geometric descriptors and then through a Centered Kernel Alignment loss to match similarity structures, with alignment most effective at later transformer layers where spatial relations emerge. Geometry- and contact-centric teacher properties drive the gains rather than semantic classification ability.

What carries the argument

Topological representation alignment that matches token-wise cosine similarities or centered kernel alignment between the student flow model representations and the frozen teacher encoder representations.

If this is right

Training converges up to 6.9 times faster than unaligned flow-matching baselines.
Assembly accuracy rises on in-distribution benchmarks spanning geometric, semantic, and inter-object tasks.
Performance holds up better under domain shifts to unseen real-world and synthetic datasets.
Zero-shot transfer gains are especially large compared with prior methods.
State-of-the-art results are reached with zero added cost at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teacher encoders selected for geometric rather than semantic properties are likely to produce stronger alignment benefits in assembly tasks.
The same alignment strategy could be applied to other generative backbones such as diffusion models for 3D tasks.
Systematic sweeps of alignment depth across different network architectures would identify general rules for when relational distillation helps most.
Extending the approach to time-varying or articulated assemblies would test whether the learned topological relations remain useful beyond static part placement.

Load-bearing premise

That geometry- and contact-centric properties extracted from the frozen teacher encoder supply the right relational guidance for the flow model and that alignment at later layers improves outcomes without negative transfer.

What would settle it

A controlled training run on one of the five assembly benchmarks in which adding the TORA alignment loss produces slower convergence or lower final accuracy than the baseline flow model without alignment.

Figures

Figures reproduced from arXiv: 2604.04050 by Marc Pollefeys, Nahyuk Lee, Sunghwan Hong, Zhiang Chen.

**Figure 2.** Figure 2: Overview of the Topological Representation Alignment (TORA) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Conceptual illustration of alignment objectives. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation Analysis of Teacher Representations. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of different teachers on distillation. We compare the Part Accuracy of TORA across Lcos-dist and LCKA across various 3D foundation models as teachers on Breaking Bad dataset [45]. The dashed line indicates the RPF baseline [48] [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Spatial structure emerges in later layers. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on alignment layer depth. Part Accuracy consistently improves when applying alignment to later layers. in pose discrimination further suggests that pose-awareness—necessary for 6-DoF recovery—emerges predominantly in later layers where global context is integrated. Motivated by these trends, we apply alignment at late representations, where the model is actively forming the global structu… view at source ↗

**Figure 8.** Figure 8: Qualitative Comparison. While the baseline RPF often struggles with precise part positioning and fails to resolve complex inter-part relations, ours consistently produces structurally coherent assemblies that closely match the ground truth. 5.3 Experimental Results Multi-part Assembly [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Convergence comparison. We monitor the validation Part Accuracy of Ours (CKA) ( ) against the RPF ( ) and other alignment strategies including Ours (NTXent) ( ) and Ours (Cos-dist) ( ) over training epochs across three datasets. The dashed horizontal line ( ) represents the peak accuracy of the baseline. The annotated multipliers indicate the convergence speedup relative to the baseline to reach its peak … view at source ↗

read the original abstract

Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces TORA, a topology-first representation alignment framework for flow-matching models in 3D shape assembly. It distills relational geometric and contact-centric structure from a frozen pretrained 3D encoder into the flow backbone via token-wise cosine matching and Centered Kernel Alignment (CKA) loss, with alignment applied most effectively at later transformer layers. The method is claimed to deliver up to 6.9× faster convergence, higher in-distribution accuracy, improved robustness under domain shift, and state-of-the-art results on five benchmarks spanning geometric, semantic, and inter-object assembly tasks, all with zero inference overhead.

Significance. If the reported gains hold under rigorous verification, TORA would offer a practical, low-cost way to inject useful relational priors from pretrained geometry encoders into flow-matching pipelines, with particular value for zero-shot transfer in assembly tasks. The systematic encoder-probing experiments provide useful empirical guidance on which pretrained properties (geometry/contact vs. semantics) transfer effectively.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are encouraged that the topology-first alignment approach, its empirical guidance on encoder properties, and the reported gains in convergence and robustness are viewed as potentially valuable for flow-matching pipelines.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on an empirical pipeline: a frozen external pretrained 3D encoder supplies relational targets (via token-wise cosine or CKA losses) that are injected into a flow-matching student during training; benefits are then measured on held-out benchmarks. No derivation, equation, or performance metric is defined in terms of itself or reduced to a fitted parameter that is later renamed a prediction. Alignment targets originate outside the model being trained, and reported gains (convergence speed, accuracy, zero-shot robustness) are independent observables rather than tautological consequences of the loss. Any self-citations are peripheral and non-load-bearing for the core argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained 3D encoders already encode useful cross-part relational structure for assembly; no new entities are introduced and no explicit free parameters beyond standard training losses are named.

axioms (1)

domain assumption Frozen pretrained 3D encoders capture geometry- and contact-centric relational structure that is useful for guiding flow-matching assembly
The method selects and aligns to such encoders; effectiveness is claimed to depend on this property rather than semantic classification ability.

pith-pipeline@v0.9.0 · 5539 in / 1285 out tokens · 46272 ms · 2026-05-13T17:06:53.768947+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

alignment is most beneficial at later transformer layers where spatial structure naturally emerges

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

[1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025) 3

work page 2025
[2]

In: Sensor fusion IV: control paradigms and data structures

Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992) 22, 23

work page 1992
[3]

In: ACM SIGGRAPH 2011 papers, pp

Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3d modeling. In: ACM SIGGRAPH 2011 papers, pp. 1–10 (2011) 1

work page 2011
[4]

Advances in Neural Information Processing Systems34, 9011–9023 (2021) 3

Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems34, 9011–9023 (2021) 3

work page 2021
[5]

Cho, S., Hong, S., Kim, S.: Cats++: Boosting cost aggregation with convolutions andtransformers.IEEETransactionsonPatternAnalysisandMachineIntelligence 45(6), 7174–7194 (2022) 3

work page 2022
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024) 6

work page 2024
[7]

The Journal of Machine Learning Research13, 795–828 (2012) 3, 7

Cortes, C., Mohri, M., Rostamizadeh, A.: Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research13, 795–828 (2012) 3, 7

work page 2012
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 18

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

com / PyTorchLightning/pytorch-lightning11

Falcon, W., team, T.P.L.: Pytorch lightning (2019),https : / / github . com / PyTorchLightning/pytorch-lightning11

work page 2019
[10]

arXiv preprint arXiv:2601.02457 (2026) 8, 18

Hadgi, S., Gong, B., Sundararaman, R., Pierson, E., Li, L., Wonka, P., Ovsjanikov, M.: Patchalign3d: Local feature alignment for dense 3d shape understanding. arXiv preprint arXiv:2601.02457 (2026) 8, 18

work page arXiv 2026
[11]

D 2USt3R: Enhancing 3D reconstruction for dynamic scenes.arXiv preprint arXiv:2504.06264, 2025

Han, J., An, H., Jung, J., Narihira, T., Seo, J., Fukuda, K., Kim, C., Hong, S., Mitsufuji, Y., Kim, S.: Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes. arXiv preprint arXiv:2504.06264 (2025) 3

work page arXiv 2025
[12]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025) 3

work page arXiv 2025
[13]

Ad- vanced Robotics30(17-18), 1186–1198 (2016) 1

Harada, K., Nagata, K., Rojas, J., Ramirez-Alpizar, I.G., Wan, W., Onda, H., Tsuji, T.: Proposal of a shape adaptive gripper for robotic assembly tasks. Ad- vanced Robotics30(17-18), 1186–1198 (2016) 1

work page 2016
[14]

In: Proceedings of the European conference on computer vision (ECCV)

Hodan, T., Michel, F., Brachmann, E., Kehl, W., GlentBuch, A., Kraft, D., Drost, B., Vidal, J., Ihrke, S., Zabulis, X., et al.: Bop: Benchmark for 6d object pose esti- mation. In: Proceedings of the European conference on computer vision (ECCV). pp. 19–34 (2018) 23

work page 2018
[15]

arXiv preprint arXiv:2403.11120 (2024) 3

Hong, S., Cho, S., Kim, S., Lin, S.: Unifying feature and cost aggrega- tion with transformers for semantic and visual correspondence. arXiv preprint arXiv:2403.11120 (2024) 3

work page arXiv 2024
[16]

Hong, S., Cho, S., Nam, J., Lin, S., Kim, S.: Cost aggregation with 4d convolutional swintransformerforfew-shotsegmentation.In:EuropeanConferenceonComputer Vision. pp. 108–126. Springer (2022) 3 TORA: Topological Representation Alignment 33

work page 2022
[17]

arXiv preprint arXiv:2410.22128 (2024) 3

Hong, S., Jung, J., Shin, H., Han, J., Yang, J., Luo, C., Kim, S.: Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128 (2024) 3

work page arXiv 2024
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, S., Jung, J., Shin, H., Yang, J., Kim, S., Luo, C.: Unifying correspondence pose and nerf for generalized pose-free novel view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20196– 20206 (2024) 3

work page 2024
[19]

Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense corre- spondence.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputer vision. pp. 9907–9917 (2021) 3

work page 2021
[20]

Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022) 3

Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022) 3

work page 2022
[21]

Huang, J., Zhan, G., Fan, Q., Mo, K., Shao, L., Chen, B., Guibas, L., Dong, H.: Generative 3d part assembly via dynamic graph learning (2020) 3

work page 2020
[22]

In: Proceedings of IEEE computer society conference on Com- puter Vision and Pattern Recognition

Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of IEEE computer society conference on Com- puter Vision and Pattern Recognition. pp. 762–768. IEEE (1997) 16

work page 1997
[23]

Jocher, G., Stoken, A., Borovec, J., Changyu, L., Hogan, A., Chaurasia, A., Dia- conu, L., Ingham, F., Colmagro, A., Ye, H., et al.: ultralytics/yolov5: v4. 0-nn. silu () activations, weights & biases logging, pytorch hub integration. Zenodo (2021) 6

work page 2021
[24]

ACM Transactions on Graphics (TOG)39(6), 1–20 (2020) 1

Jones, R.K., Barton, T., Xu, X., Wang, K., Jiang, E., Guerrero, P., Mitra, N.J., Ritchie, D.: Shapeassembly: Learning to generate programs for 3d shape structure synthesis. ACM Transactions on Graphics (TOG)39(6), 1–20 (2020) 1

work page 2020
[25]

arXiv preprint arXiv:2509.18096 , year=

Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025) 4

work page arXiv 2025
[26]

In: CVPR (2023) 3, 10, 13

Lamb, N., Palmer, C., Molloy, B., Banerjee, S., Banerjee, N.K.: Fantastic breaks: A dataset of paired 3d scans of real-world broken objects and their complete coun- terparts. In: CVPR (2023) 3, 10, 13

work page 2023
[27]

arXiv preprint arXiv:2510.14945 (2025) 4

Lee, J., Jung, J., Han, J., Narihira, T., Fukuda, K., Seo, J., Hong, S., Mitsufuji, Y., Kim, S.: 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945 (2025) 4

work page arXiv 2025
[28]

In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3, 11

Lee, N., Min, J., Lee, J., Kim, S., Lee, K., Park, J., Cho, M.: 3d geometric shape assembly via efficient point cloud matching. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3, 11

work page 2024
[29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lee, N., Min, J., Lee, J., Park, C., Cho, M.: Combinative matching for geometric shape assembly. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9540–9549 (2025) 3, 11

work page 2025
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025) 3, 4

work page 2025
[31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, S., Jiang, Z., Chen, G., Xu, C., Tan, S., Wang, X., Fang, I., Zyskowski, K., McPherron, S.P., Iovita, R., et al.: Garf: Learning generalizable 3d reassembly for real-world fractures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5711–5721 (2025) 2, 3, 10, 11, 13

work page 2025
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Mo, K., Duan, Y., Wang, H., Zhang, J., Shao, L.: Category-level multi-part multi-joint 3d shape assembly. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3281–3291 (2024) 3 34 N. Lee et al

work page 2024
[33]

Advances in neural information processing systems36, 44860–44879 (2023) 8, 18, 21

Liu, M., Shi, R., Kuang, K., Zhu, Y., Li, X., Han, S., Cai, H., Porikli, F., Su, H.: Openshape:Scalingup3dshaperepresentationtowardsopen-worldunderstanding. Advances in neural information processing systems36, 44860–44879 (2023) 8, 18, 21

work page 2023
[34]

In: ICLR (2019) 11, 15

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 11, 15

work page 2019
[35]

In: Computer Graphics Forum

Lu, J., Liang, Y., Han, H., Hua, J., Jiang, J., Li, X., Huang, Q.: A survey on computational solutions for reconstructing complete objects by reassembling their fractured parts. In: Computer Graphics Forum. vol. 44, p. e70081. Wiley Online Library (2025) 3

work page 2025
[36]

Lu, J., Sun, Y., Huang, Q.: Jigsaw: Learning to assemble multiple fractured objects (2023),https://openreview.net/forum?id=OwpaO4w6K73, 11

work page 2023
[37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7818–7827 (2025) 8, 18, 21

work page 2025
[38]

In: CVPRW (2003) 1

McBride, J.C., Kimia, B.B.: Archaeological fragment reconstruction using curve- matching. In: CVPRW (2003) 1

work page 2003
[39]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 18

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4, 5

work page 2023
[41]

Advances in neural information processing systems30(2017) 19, 21

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. Advances in neural information processing systems30(2017) 19, 21

work page 2017
[42]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Qi, Y., Ju, Y., Wei, T., Chu, C., Wong, L.L., Xu, H.: Two by two: Learning multi- task pairwise objects assembly for generalizable robot manipulation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17383–17393 (2025) 3, 10, 11, 24

work page 2025
[43]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 18

work page 2021
[44]

Journal of computational and applied mathematics20, 53–65 (1987) 16

Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics20, 53–65 (1987) 16

work page 1987
[45]

Advances in Neural Information Processing Systems35, 38885–38898 (2022) 3, 9, 10, 11, 13, 15, 24

Sellán, S., Chen, Y.C., Wu, Z., Garg, A., Jacobson, A.: Breaking bad: A dataset for geometric fracture and reassembly. Advances in Neural Information Processing Systems35, 38885–38898 (2022) 3, 9, 10, 11, 13, 15, 24

work page 2022
[46]

Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025) 4, 16

work page arXiv 2025
[47]

In: CVPR (2013) 1

Son, K., Almeida, E.B., Cooper, D.B.: Axially symmetric 3d pots configuration system using axis of symmetry and break curve. In: CVPR (2013) 1

work page 2013
[48]

arXiv preprint arXiv:2506.05282 (2025) 2, 3, 4, 5, 9, 10, 11, 13, 14, 15, 22, 23, 24

Sun, T., Zhu, L., Huang, S., Song, S., Armeni, I.: Rectified point flow: Generic point cloud pose estimation. arXiv preprint arXiv:2506.05282 (2025) 2, 3, 4, 5, 9, 10, 11, 13, 14, 15, 22, 23, 24

work page arXiv 2025
[49]

In: ICLR (2025) 3 TORA: Topological Representation Alignment 35

Wang,Z.,Chen,J.,Furukawa,Y.:Puzzlefusion++:Auto-agglomerative3dfracture assembly by denoise and verify. In: ICLR (2025) 3 TORA: Topological Representation Alignment 35

work page 2025
[50]

arXiv preprint arXiv:2505.16792 (2025) 4

Wang, Z., Zhao, W., Zhou, Y., Li, Z., Liang, Z., Shi, M., Zhao, X., Zhou, P., Zhang, K., Wang, Z., et al.: Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. arXiv preprint arXiv:2505.16792 (2025) 4

work page arXiv 2025
[51]

Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformersismucheasierthanyouthink.arXivpreprintarXiv:2507.01467(2025) 4

work page arXiv 2025
[52]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

In: ICCV (2023) 3

Wu, R., Tie, C., Du, Y., Zhao, Y., Dong, H.: Leveraging se (3) equivariance for learning 3d geometric shape assembly. In: ICCV (2023) 3

work page 2023
[54]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, X., DeTone, D., Frost, D., Shen, T., Xie, C., Yang, N., Engel, J., Newcombe, R., Zhao, H., Straub, J.: Sonata: Self-supervised learning of reliable point rep- resentations. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22193–22204 (2025) 19

work page 2025
[55]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H.: Point transformer v3: Simpler faster stronger. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4840–4851 (2024) 18

work page 2024
[56]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neu- ral network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017) 23

work page Pith review arXiv 2017
[57]

In: 2025 International Conference on 3D Vision (3DV)

Xu, B., Zheng, S., Jin, Q.: Spaformer: Sequential 3d part assembly with transform- ers. In: 2025 International Conference on 3D Vision (3DV). pp. 1317–1327. IEEE (2025) 3, 10, 11, 24

work page 2025
[58]

arXiv preprint arXiv:2502.13986 (2025) 1

Yoo, S.J., Liu, S., Arshad, M.Z., Kim, J., Kim, Y.M., Aloimonos, Y., Fermuller, C., Joo, K., Kim, J., Hong, J.H.: Structure-from-sherds++: Robust incremental 3d reassembly of axially symmetric pots from unordered and mixed fragment col- lections. arXiv preprint arXiv:2502.13986 (2025) 1

work page arXiv 2025
[59]

arXiv preprint arXiv:2509.07979 (2025) 4

Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 4

work page arXiv 2025
[60]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024) 2, 4, 21

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313– 19322 (2022) 18

work page 2022
[62]

arXiv preprint arXiv:2512.13689 (2025) 21

Yue, Y., Robert, D., Wang, J., Hong, S., Wegner, J.D., Rupprecht, C., Schindler, K.: Litept: Lighter yet stronger point transformer. arXiv preprint arXiv:2512.13689 (2025) 21

work page arXiv 2025
[63]

Zakka, K., Zeng, A., Lee, J., Song, S.: Form2fit: Learning shape priors for gener- alizable assembly from disassembly (2020) 1

work page 2020
[64]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 18

work page 2023
[65]

arXiv preprint arXiv:2510.23607 (2025) 19 36 N

Zhang, Y., Wu, X., Lao, Y., Wang, C., Tian, Z., Wang, N., Zhao, H.: Concerto: Joint 2d-3d self-supervised learning emerges spatial representations. arXiv preprint arXiv:2510.23607 (2025) 19 36 N. Lee et al

work page arXiv 2025
[66]

Uni3d: Ex- ploring unified 3d representation at scale,

Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d representation at scale. arXiv preprint arXiv:2310.06773 (2023) 8, 11, 18, 21

work page arXiv 2023