Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself
Pith reviewed 2026-05-10 12:53 UTC · model grok-4.3
The pith
Feed-forward 3D reconstruction models can refine their own outputs at test time by enforcing consistency between full sequences and masked-frame subsets without any ground truth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Free Geometry constructs a self-supervised task from a testing sequence by masking a subset of frames, then enforces cross-view feature consistency between the representations produced from the full observation and the partial observation while also maintaining the pairwise relations implied by the held-out frames; these signals drive fast LoRA-based recalibration that improves the base model's accuracy on the same scene.
What carries the argument
The masked-frame consistency task that compares full-sequence and partial-sequence representations while preserving implied pairwise geometry, used to generate a self-supervised training signal for LoRA updates.
If this is right
- Camera pose accuracy rises by an average of 3.73 percent across four benchmark datasets.
- Point map prediction accuracy rises by an average of 2.88 percent on the same datasets.
- The same procedure works on top of existing foundation models including Depth Anything 3 and VGGT.
- Adaptation completes in less than two minutes per dataset on a single GPU.
- The gains appear in scenes containing occlusions, specular surfaces, and ambiguous visual cues.
Where Pith is reading between the lines
- The same masking-and-consistency principle might be tested on other feed-forward geometric tasks such as surface normal estimation or novel-view synthesis.
- If longer sequences continue to supply stronger signals, the method could be iterated multiple times on a single scene to produce further incremental gains.
- The approach suggests a general route for turning extra test-time observations into supervision for any model whose output quality scales with input length.
Load-bearing premise
More input views always produce more reliable and view-consistent reconstructions than fewer views, allowing masked subsets to serve as a trustworthy self-supervised signal.
What would settle it
Applying the masking-and-consistency procedure to a new test sequence and observing no improvement or a drop in camera-pose or point-map accuracy on held-out frames would show the self-supervision signal is not reliable.
Figures
read the original abstract
Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Free Geometry, a test-time adaptation framework for feed-forward 3D reconstruction models. Given a test sequence, it masks a subset of frames to create a self-supervised task that enforces cross-view feature consistency between full-sequence and partial reconstructions while preserving pairwise relations from the held-out frames. Lightweight LoRA updates are then applied to refine models such as Depth Anything 3 and VGGT. The paper reports average improvements of 3.73% in camera pose accuracy and 2.88% in point map prediction across four benchmark datasets, with the process taking less than 2 minutes per dataset on a single GPU and no 3D ground truth required.
Significance. If the central claims hold, the work would provide a practical, efficient mechanism for adapting rigid zero-shot 3D foundation models to individual test scenes via internal consistency signals. This addresses a key limitation of current feed-forward approaches in handling ambiguities like occlusions and specularities. The reported gains on standard benchmarks and the emphasis on reproducibility (code release) would make it a useful contribution to test-time adaptation in 3D vision, provided the self-supervision mechanism is shown to be robust rather than merely self-reinforcing.
major comments (2)
- [Method] Method section (description of the self-supervised consistency loss): The framework rests on the unvalidated premise that reconstructions from the full sequence are reliably more accurate and view-consistent than those from masked subsets, allowing the former to serve as pseudo-targets. No analysis, failure-case experiments, or quantitative comparison to ground truth is provided to show when this holds (e.g., under persistent ambiguities such as textureless regions or specularities). If the premise fails, the loss simply aligns the model to its own errors, directly undermining the claimed improvements.
- [Experiments] Experiments section (quantitative results and ablations): The reported average gains of 3.73% pose and 2.88% point-map accuracy are presented without ablation studies isolating the contributions of cross-view feature consistency versus pairwise relation preservation, without details on masking ratios or LoRA hyperparameters, and without analysis of variance across sequences. This makes it impossible to determine whether the gains are robust or sensitive to the specific self-supervision construction.
minor comments (2)
- [Abstract and Method] The abstract and method descriptions would benefit from explicit notation for the masking operation and the exact form of the consistency loss (e.g., whether it is L2 on features or a different metric).
- [Figures] Figure captions and the framework diagram should more clearly distinguish the full-sequence path from the masked-subset path to aid reader comprehension.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Method] Method section (description of the self-supervised consistency loss): The framework rests on the unvalidated premise that reconstructions from the full sequence are reliably more accurate and view-consistent than those from masked subsets, allowing the former to serve as pseudo-targets. No analysis, failure-case experiments, or quantitative comparison to ground truth is provided to show when this holds (e.g., under persistent ambiguities such as textureless regions or specularities). If the premise fails, the loss simply aligns the model to its own errors, directly undermining the claimed improvements.
Authors: We acknowledge that the original manuscript does not provide direct quantitative comparisons to ground truth or failure-case analyses specifically validating that full-sequence reconstructions are superior to masked ones. The reported improvements on standard benchmarks provide indirect evidence of the method's effectiveness. To address this concern rigorously, we will add in the revised manuscript: (1) quantitative comparisons of full vs. masked reconstruction accuracy against ground truth on a subset of sequences, (2) failure case studies highlighting scenarios with textureless regions and specularities, and (3) discussion of conditions under which the premise holds. This will clarify the robustness of the self-supervision signal. revision: yes
-
Referee: [Experiments] Experiments section (quantitative results and ablations): The reported average gains of 3.73% pose and 2.88% point-map accuracy are presented without ablation studies isolating the contributions of cross-view feature consistency versus pairwise relation preservation, without details on masking ratios or LoRA hyperparameters, and without analysis of variance across sequences. This makes it impossible to determine whether the gains are robust or sensitive to the specific self-supervision construction.
Authors: We agree that additional details and ablations are necessary to demonstrate the robustness of the results. The original submission focused on overall performance but omitted component-wise ablations, specific hyperparameter values, and per-sequence variance. In the revision, we will include: ablations separating the effects of cross-view consistency and pairwise preservation, tables detailing masking ratios (e.g., 20-50%) and LoRA configurations (rank, alpha), and standard deviation or per-dataset variance analysis for the reported metrics. These additions will allow readers to assess sensitivity and reproducibility. revision: yes
Circularity Check
No significant circularity; self-supervision uses held-out frames with external benchmark validation
full rationale
The paper's core mechanism generates a self-supervised consistency loss by comparing the model's output on a full test sequence against its output on a masked subset of the same sequence, then applies LoRA updates. This does not reduce to a tautology by construction because the full-sequence output is not mathematically forced to equal the masked output; the loss is minimized through parameter updates whose effect is measured on independent ground-truth benchmarks (camera pose and point map accuracy). No equations are presented that equate the target to the input by definition, no parameters are fitted on a subset and then renamed as a prediction of the same quantity, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation therefore remains self-contained against external evaluation rather than self-referential.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption When the model receives more views, it produces more reliable and view-consistent reconstructions
Reference graph
Works this paper leans on
-
[1]
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
work page 2021
-
[2]
In: International Conference on Learning Representations (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
work page 2021
-
[3]
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Do- ersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko, M.: Bootstrap your own latent: A new approach to self-supervised learn- ing. In: NeurIPS (2020)
work page 2020
-
[4]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
In: International Conference on Learning Representations (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
work page 2022
-
[6]
In: The Fourteenth International Conference on Learning Representations (2026)
Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Zhao, Y., Peng, S., Guo, H., Zhou, X., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. In: The Fourteenth International Conference on Learning Representations (2026)
work page 2026
-
[7]
(eds.) Advances in Neural Infor- mation Processing Systems (2021)
Liu, Y., Kothari, P., van Delft, B.G., Bellot-Gurlet, B., Mordan, T., Alahi, A.: TTT++: When does self-supervised test-time training fail or thrive? In: Beygelz- imer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Infor- mation Processing Systems (2021)
work page 2021
-
[8]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
work page 2019
-
[9]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: DINOv2: Learning robust visual features without supervision. TMLR (2024)
work page 2024
-
[10]
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)
work page 2019
-
[11]
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)
work page 2015
-
[12]
In: Conference on Computer Vision and Pattern Recognition (2016)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (2016)
work page 2016
-
[13]
Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)
work page 2017
-
[14]
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)
work page 2013
-
[15]
Sun,Y.,Wang,X.,Zhuang,L.,Miller,J.,Hardt,M.,Efros,A.A.:Test-timetraining with self-supervision for generalization under distribution shifts. In: ICML (2020)
work page 2020
-
[16]
IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991)
Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991)
work page 1991
- [17]
-
[18]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)
work page 2025
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[20]
European Conference on Computer Vision (2018)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc- tured multi-view stereo. European Conference on Computer Vision (2018)
work page 2018
-
[21]
Yeshwanth,C.,Liu,Y.C.,Nießner,M.,Dai,A.:ScanNet++:Ahigh-fidelitydataset of 3d indoor scenes. In: ICCV (2023)
work page 2023
-
[22]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Yuan, Y., Shen, Q., Wang, S., Yang, X., Wang, X.: Test3r: Learning to reconstruct 3d at test time. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[23]
Zhang, M., Levine, S., Finn, C.: MEMO: Test time robustness via adaptation and augmentation. In: NeurIPS (2022) Free Geometry 1 Supplementary Material 1 Method Details 1.1 Free Geometry Self-Supervised Geometric Losses Free Geometry performs test-time adaptation through a self-supervised geo- metric objective defined between two branches of the same scene...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.