pith. sign in

arxiv: 2607.01851 · v1 · pith:VWIC6CW5new · submitted 2026-07-02 · 💻 cs.CV

Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction

Pith reviewed 2026-07-03 16:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords knowledge distillation3D reconstructionmodel compressionlunar imagerystereo reconstructionfoundation modelsgeometric prediction
0
0 comments X

The pith

Distilling a 688M-parameter 3D model produces students up to seven times smaller that retain lunar stereo accuracy and beat sparse baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large foundation models achieve strong 3D reconstruction from stereo images but exceed the compute budgets available on planetary spacecraft. This paper shows that knowledge distillation, aided by an SVD-based projection of teacher weights into the student decoder, can compress such a model while preserving most of its geometric output quality on lunar imagery. The compressed students also exceed the performance of models trained directly on sparse ground-truth labels. The work identifies concrete design rules that emerged from the experiments, including the relative importance of encoder capacity and the advantage of feature-level over output-level supervision.

Core claim

Starting from a 688M-parameter MASt3R teacher fine-tuned on lunar imagery, distilling its dense geometric predictions into a family of lightweight students via structured SVD-based initialization yields models up to seven times smaller that retain most of the teacher's reconstruction accuracy and outperform baselines trained with sparse ground-truth annotations.

What carries the argument

structured SVD-based initialization that projects the teacher's decoder weights into the student's smaller latent space to provide a warm start for distillation

If this is right

  • Feature-level distillation consistently outperforms output-only supervision.
  • Preserving encoder capacity matters more than maintaining a large decoder.
  • A convolutional encoder underperforms transformer-based alternatives under the same distillation regime.
  • SVD-based initialization improves optimisation stability and final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation recipe could be tested on other resource-limited stereo tasks such as drone navigation or satellite mapping.
  • Real hardware profiling on target flight processors would reveal whether the reported size reductions translate to usable frame rates in orbit.
  • The observed priority of encoder capacity suggests future geometric foundation models should allocate parameters differently when compression is anticipated.

Load-bearing premise

The lunar imagery dataset used for fine-tuning and evaluation is representative enough that the reported accuracy retention will hold under actual planetary deployment hardware constraints.

What would settle it

A student model produced by the distillation procedure shows substantially lower reconstruction accuracy than the teacher on a new set of lunar stereo pairs never seen during training or evaluation.

Figures

Figures reproduced from arXiv: 2607.01851 by Cl\'ementine Grethen, Florient Chouteau, G\'eraldine Morin, Simone Gasparini.

Figure 1
Figure 1. Figure 1: Our student models, distilled from the teacher (MASt3R, 688 . 6M param￾eters), recover depth maps (top) and surface slopes (bottom) for view 1 of two un￾calibrated lunar stereo pairs. Students are ordered left to right from best to weakest: S2 (ViT-Small, 154 . 9M, 4 . 4 × smaller) closely matches the teacher in both depth and slope; S3 (reduced decoder, 94 . 9M, 7 . 3 × smaller) remains competitive; S4 (V… view at source ↗
Figure 2
Figure 2. Figure 2: Teacher-student distillation framework. The teacher supervises the student via feature and output losses. (a) Encoders candidates (b) Two compact decoder configu￾rations (dimension/depth/attention heads) illumination changes [7]. Classical photometric methods and SfM pipelines often degrade in this setting because correspondences are hard to establish, and de￾scent trajectories may provide limited parallax… view at source ↗
Figure 3
Figure 3. Figure 3: Central profile results for a lunar stereo pair and corresponding 3D scene. Com￾parison between S2 (best student), S1 (weaker student), and the teacher. S1 exhibits poor cross-view alignment and a central profile poorly correlated with the ground-truth terrain, while S2 produces results closely matching the teacher. assess terrain safety—critical for lunar landing—we compute the Slope Mean Absolute Error (… view at source ↗
Figure 4
Figure 4. Figure 4: Robustness to lighting variations on a challenging lunar crater scene. The same scene is evaluated, using S2 student, under three different illumination conditions from the test set. The first column shows the input stereo pairs under the three lighting setups, the second column shows the ground-truth (GT) depth maps, the third column shows the predicted depth maps, and the fourth column shows the central … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation results on one stereo pair with a challenging camera base￾line for the S2 student. Rows 1–2 show the predicted surface slope maps for each view, while Row 3 shows the reconstructed 3D scene (colors indicate depth, with blue cor￾responding to closer points and red to farther points). Colored boxes highlight regions of interest discussed in the text. Exp. A corresponds to the full model,… view at source ↗
read the original abstract

Large 3D foundation models such as MASt3R achieve state-of-the-art stereo reconstruction but are computationally demanding for deployment under strict hardware constraints -- a critical limitation in domains such as planetary exploration, where onboard computing is severely restricted. We study how far such models can be compressed through knowledge distillation, using lunar stereo reconstruction as a challenging and practically relevant case study. Starting from a 688M-parameter MASt3R teacher fine-tuned on lunar imagery, we distill its dense geometric predictions into a family of lightweight students spanning different encoder types (CNN vs ViT), decoder widths and depths, and training strategies. To bridge the dimensional mismatch between teacher and student, we propose a structured SVD-based initialization that projects the teacher's decoder weights into the student's smaller latent space, yielding a warm start that significantly improves convergence and final performance. Based on our results on lunar data, we can obtain a distilled student that retains most of teacher's reconstruction accuracy while reducing the model size up to 7 times, and even outperforms a baseline trained directly with sparse ground-truth annotations. Beyond compression, our study highlights both principles and practical insights for distilling geometric foundation models: a convolutional encoder underperforms transformer-based alternatives (though pretraining availability remains a confounding factor), preserving encoder capacity is more critical than maintaining a large decoder, feature-level distillation consistently outperforms output-only supervision, and SVD-based initialization improves optimisation stability. These findings provide practical guidelines for deploying 3D reconstruction models in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies knowledge distillation to compress the 688M-parameter MASt3R teacher (fine-tuned on lunar imagery) into smaller student models for stereo 3D reconstruction. It introduces an SVD-based decoder initialization to address latent-space mismatch, compares CNN vs. ViT encoders, feature-level vs. output-level distillation, and varying decoder widths/depths. On the lunar dataset the best students retain most of the teacher's accuracy at up to 7× smaller size and outperform a sparse ground-truth baseline; the study also reports practical guidelines (encoder capacity matters more than decoder size, feature distillation is preferable, SVD warm-start aids convergence).

Significance. If the reported retention of accuracy and outperformance of the sparse baseline hold under the stated protocol, the work supplies concrete, actionable guidance for deploying geometric foundation models under the severe compute limits of planetary missions. The SVD initialization technique and the controlled architecture/distillation ablations are reusable contributions beyond the lunar case. The manuscript does not claim parameter-free derivations or machine-checked proofs, but the explicit baseline comparisons and architecture sweeps strengthen its empirical value.

minor comments (3)
  1. [Abstract / Results] Abstract and results paragraph state quantitative gains and the 7× compression claim but do not report exact metrics, error bars, or data-split details; these should be added to the main results section (or a table) so readers can verify the retention and outperformance statements.
  2. [Results] The claim that the distilled student 'outperforms a baseline trained directly with sparse ground-truth annotations' is central; the manuscript should clarify whether the sparse baseline used identical training data volume, augmentation, and optimization budget as the distillation runs.
  3. [Discussion] The discussion of CNN vs. ViT encoders notes pretraining availability as a confounding factor; a short additional experiment or explicit statement on whether both encoder families started from the same pretraining regime would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, detailed summary, and recommendation for minor revision. The report contains no enumerated major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical distillation study that fine-tunes a MASt3R teacher on lunar imagery then trains a family of smaller student models using explicit protocols (SVD-based decoder initialization, feature-level vs. output-level distillation, CNN vs. ViT encoders) and reports accuracy/size results against concrete baselines. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; all central claims rest on experimental comparisons on the chosen dataset rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about the transferability of dense predictions and the utility of the provided lunar imagery; the SVD initialization is presented as an engineering contribution rather than a new postulated entity.

free parameters (1)
  • student encoder/decoder widths and depths
    Family of lightweight students with varying dimensions is explored; specific widths are chosen and tested rather than derived.
axioms (1)
  • domain assumption The teacher model's dense geometric predictions on lunar imagery constitute high-quality supervision suitable for distillation
    Invoked throughout the distillation setup described in the abstract.

pith-pipeline@v0.9.1-grok · 5807 in / 1317 out tokens · 37234 ms · 2026-07-03T16:20:42.868992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    In: ECCV (2022)

    Arnold, E., Wynn, J., Vicente, S., Garcia-Hernando, G., Monszpart, Á., Prisacariu, V.A., Turmukhambetov, D., Brachmann, E.: Map-free visual relocalization: Metric pose relative to a single image. In: ECCV (2022)

  2. [2]

    In: European Conference on Computer Vision

    Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-hmr: Multi-person whole-body human mesh recovery in a single shot. In: European Conference on Computer Vision. pp. 202–218. Springer (2024)

  3. [3]

    Interna- tional Journal of Computer Vision35(1), 33–44 (Nov 1999).https://doi.org/ 10.1023/a:1008154927611

    Belhumeur, P.N., Kriegman, D.J., Yuille, A.L.: The bas-relief ambiguity. Interna- tional Journal of Computer Vision35(1), 33–44 (Nov 1999).https://doi.org/ 10.1023/a:1008154927611

  4. [4]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021), https://arxiv.org/abs/2010.11929

  5. [5]

    Dutt, A., Lunawat, I., Kaur, M.: Multi-view 3d reconstruction using knowledge distillation (2026),https://arxiv.org/abs/2412.02039

  6. [6]

    Psychometrika1(3), 211–218 (1936)

    Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika1(3), 211–218 (1936)

  7. [7]

    Getchius, J., Renshaw, D., Posada, D., Henderson, T., Hong, L., Ge, S., Molina, G.: Hazard Detection and Avoidance for the Nova-C Lander, p. 921–943. Springer International Publishing (2024).https://doi.org/10.1007/978-3-031-51928- 4_53

  8. [8]

    In- ternational Journal of Computer Vision129(6), 1789–1819 (Mar 2021).https: //doi.org/10.1007/s11263-021-01453-z

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. In- ternational Journal of Computer Vision129(6), 1789–1819 (Mar 2021).https: //doi.org/10.1007/s11263-021-01453-z

  9. [9]

    In: iccvw

    Grethen, C., Gasparini, S., Morin, G., Lebreton, J., Marti, L., Sanchez-Gestido, M.: Adapting stereo vision from objects to 3d lunar surface reconstruction with the stereolunar dataset. In: iccvw. pp. 3751–3760 (2025)

  10. [10]

    Guo, M., Cao, M., Tao, J., Xu, R., Yan, Y., Liang, X., Laptev, I., Chang, X.: Glad: Geometric latent distillation for vision-language-action models (2025),https:// arxiv.org/abs/2512.09619

  11. [11]

    Cam- bridge University Press, ISBN: 0521540518, second edn

    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam- bridge University Press, ISBN: 0521540518, second edn. (2004)

  12. [12]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) Geometric Foundation Model Distillation 21

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

    Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

  14. [14]

    Iclr1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

  15. [15]

    Kumar, A., Kaushal, S., Murthy, S.V.: Moonmetasync: Lunar image registration analysis (2024),https://arxiv.org/abs/2410.11118

  16. [16]

    Leblanc, B., Poullis, C.: Distill3r: A pipeline for democratizing 3d foundation mod- els on commodity hardware (2026)

  17. [17]

    In: ECCV (2024)

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with MASt3R. In: ECCV (2024)

  18. [18]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  19. [19]

    in international conference on learning representations (2015)

    Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. in international conference on learning representations (2015)

  20. [20]

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2020),https://arxiv.org/abs/1910.01108

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sarıyıldız, M.B., Weinzaepfel, P., Lucas, T., De Jorge, P., Larlus, D., Kalantidis, Y.: Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 30084–30094 (2025)

  22. [22]

    In: AIAA SCITECH 2023 Forum

    Steffes, S.R., DeTrempe, P., Barton, G., Woffinden, D.: Hazard boresight relative navigation for safe lunar landing. In: AIAA SCITECH 2023 Forum. American Institute of Aeronautics and Astronautics (Jan 2023).https://doi.org/10.2514/ 6.2023-0691

  23. [23]

    arXiv preprint arXiv:2412.16719 (2024)

    Sy, Y., Cerisara, C., Illina, I.: Lillama: Large language models compression via low-rank feature distillation. arXiv preprint arXiv:2412.16719 (2024)

  24. [24]

    In: International conference on machine learning

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357. PMLR (2021)

  25. [25]

    Proceedings of the Royal Society of London

    Ullman, S.: The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences203(1153), 405–426 (Jan 1979). https://doi.org/10.1098/rspb.1979.0006

  26. [26]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  27. [27]

    Vuong, A.D., Vu, M.N., Reid, I.: Improving robotic manipulation with efficient geometry-aware vision encoder (2025),https://arxiv.org/abs/2509.15880

  28. [28]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

  29. [29]

    arXiv preprint arXiv:2310.02328 (2023)

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. arXiv preprint arXiv:2310.02328 (2023)

  30. [30]

    Advances in Neural Information Process- ing Systems35, 3502–3516 (2022) 22 C

    Weinzaepfel, P., Leroy, V., Lucas, T., Brégier, R., Cabon, Y., Arora, V., Antsfeld, L., Chidlovskii, B., Csurka, G., Revaud, J.: Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Process- ing Systems35, 3502–3516 (2022) 22 C. Grethen

  31. [31]

    The International Archives of the Photogrammetry, Remote Sensing and Spatial Information SciencesXLI-B4, 521–527 (Jun 2016).https://doi.org/10

    Wu,B.,Liu,W.C.,Grumpe,A.,Wöhler,C.:Shapeandalbedofromshading(SAfS) forpixel-leveldemgenerationfrommonocularimagesconstrainedbylow-resolution dem. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information SciencesXLI-B4, 521–527 (Jun 2016).https://doi.org/10. 5194/isprs-archives-xli-b4-521-2016

  32. [32]

    Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretrainingdistillationforsmallvisiontransformers.In:ECCV.pp.68–85.Springer (2022)

  33. [33]

    A Survey on Knowledge Distillation of Large Language Models

    Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., Zhou, T.: A survey on knowledge distillation of large language models (2024),https://arxiv. org/abs/2402.13116

  34. [34]

    In: CVPR (2024)

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

  35. [35]

    Yu, Z., Wen, Y., Mou, L.: Revisiting intermediate-layer matching in knowledge distillation: Layer-selection strategy doesn’t matter (much). In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics. pp. 1686–1694 (2025)

  36. [36]

    Streaming 4D Visual Geometry Transformer

    Zhuo, D., Zheng, W., Guo, J., Wu, Y., Zhou, J., Lu, J.: Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539 (2025)