pith. sign in

arxiv: 2604.14048 · v1 · submitted 2026-04-15 · 💻 cs.CV

Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

Pith reviewed 2026-05-10 12:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructiontest-time adaptationself-supervised learningfeed-forward modelscamera posepoint mapsLoRA updatesmulti-view consistency
0
0 comments X

The pith

Feed-forward 3D reconstruction models can refine their own outputs at test time by enforcing consistency between full sequences and masked-frame subsets without any ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that lets rigid feed-forward 3D models adapt to individual test scenes by treating longer view sequences as a source of self-supervision. It rests on the observation that adding more input views yields reconstructions that are more reliable across viewpoints, so the method masks some frames, compares the model's representations from the complete versus reduced inputs, and enforces matching features plus preserved pairwise geometry. Lightweight LoRA updates then recalibrate the model in under two minutes per scene. This yields measurable gains in camera pose accuracy and point map quality on standard benchmarks for models such as Depth Anything 3 and VGGT.

Core claim

Free Geometry constructs a self-supervised task from a testing sequence by masking a subset of frames, then enforces cross-view feature consistency between the representations produced from the full observation and the partial observation while also maintaining the pairwise relations implied by the held-out frames; these signals drive fast LoRA-based recalibration that improves the base model's accuracy on the same scene.

What carries the argument

The masked-frame consistency task that compares full-sequence and partial-sequence representations while preserving implied pairwise geometry, used to generate a self-supervised training signal for LoRA updates.

If this is right

  • Camera pose accuracy rises by an average of 3.73 percent across four benchmark datasets.
  • Point map prediction accuracy rises by an average of 2.88 percent on the same datasets.
  • The same procedure works on top of existing foundation models including Depth Anything 3 and VGGT.
  • Adaptation completes in less than two minutes per dataset on a single GPU.
  • The gains appear in scenes containing occlusions, specular surfaces, and ambiguous visual cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking-and-consistency principle might be tested on other feed-forward geometric tasks such as surface normal estimation or novel-view synthesis.
  • If longer sequences continue to supply stronger signals, the method could be iterated multiple times on a single scene to produce further incremental gains.
  • The approach suggests a general route for turning extra test-time observations into supervision for any model whose output quality scales with input length.

Load-bearing premise

More input views always produce more reliable and view-consistent reconstructions than fewer views, allowing masked subsets to serve as a trustworthy self-supervised signal.

What would settle it

Applying the masking-and-consistency procedure to a new test sequence and observing no improvement or a drop in camera-pose or point-map accuracy on held-out frames would show the self-supervision signal is not reliable.

Figures

Figures reproduced from arXiv: 2604.14048 by Xingyi Yang, Yuhang Dai.

Figure 1
Figure 1. Figure 1: Free Geometry enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth and generalize on models and datasets. Abstract. Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly und… view at source ↗
Figure 2
Figure 2. Figure 2: Long Sequence Provides Better Reconstruction Geometry. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of Free Geometry. The test sequence is processed in two configurations. Top: the full observation (all views, e.g. 8 views) passes through the Image Patch Embedding (e.g. DINOv2 [9]), the Multi-view Transformer, a randomized camera token, and encodes the views into feature representations. All encoders are frozen (gray). Bottom: the partial observation (half of views masked, e.g. 4 views) pass… view at source ↗
Figure 4
Figure 4. Figure 4: Self-Supervised Geometric Losses of Free Geometry: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results On Multi-view Depth. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Results on 3D Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 1
Figure 1. Figure 1: Qualitative Results on 3D Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p027_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Results on Multi-view Depth. [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
read the original abstract

Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Free Geometry, a test-time adaptation framework for feed-forward 3D reconstruction models. Given a test sequence, it masks a subset of frames to create a self-supervised task that enforces cross-view feature consistency between full-sequence and partial reconstructions while preserving pairwise relations from the held-out frames. Lightweight LoRA updates are then applied to refine models such as Depth Anything 3 and VGGT. The paper reports average improvements of 3.73% in camera pose accuracy and 2.88% in point map prediction across four benchmark datasets, with the process taking less than 2 minutes per dataset on a single GPU and no 3D ground truth required.

Significance. If the central claims hold, the work would provide a practical, efficient mechanism for adapting rigid zero-shot 3D foundation models to individual test scenes via internal consistency signals. This addresses a key limitation of current feed-forward approaches in handling ambiguities like occlusions and specularities. The reported gains on standard benchmarks and the emphasis on reproducibility (code release) would make it a useful contribution to test-time adaptation in 3D vision, provided the self-supervision mechanism is shown to be robust rather than merely self-reinforcing.

major comments (2)
  1. [Method] Method section (description of the self-supervised consistency loss): The framework rests on the unvalidated premise that reconstructions from the full sequence are reliably more accurate and view-consistent than those from masked subsets, allowing the former to serve as pseudo-targets. No analysis, failure-case experiments, or quantitative comparison to ground truth is provided to show when this holds (e.g., under persistent ambiguities such as textureless regions or specularities). If the premise fails, the loss simply aligns the model to its own errors, directly undermining the claimed improvements.
  2. [Experiments] Experiments section (quantitative results and ablations): The reported average gains of 3.73% pose and 2.88% point-map accuracy are presented without ablation studies isolating the contributions of cross-view feature consistency versus pairwise relation preservation, without details on masking ratios or LoRA hyperparameters, and without analysis of variance across sequences. This makes it impossible to determine whether the gains are robust or sensitive to the specific self-supervision construction.
minor comments (2)
  1. [Abstract and Method] The abstract and method descriptions would benefit from explicit notation for the masking operation and the exact form of the consistency loss (e.g., whether it is L2 on features or a different metric).
  2. [Figures] Figure captions and the framework diagram should more clearly distinguish the full-sequence path from the masked-subset path to aid reader comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Method] Method section (description of the self-supervised consistency loss): The framework rests on the unvalidated premise that reconstructions from the full sequence are reliably more accurate and view-consistent than those from masked subsets, allowing the former to serve as pseudo-targets. No analysis, failure-case experiments, or quantitative comparison to ground truth is provided to show when this holds (e.g., under persistent ambiguities such as textureless regions or specularities). If the premise fails, the loss simply aligns the model to its own errors, directly undermining the claimed improvements.

    Authors: We acknowledge that the original manuscript does not provide direct quantitative comparisons to ground truth or failure-case analyses specifically validating that full-sequence reconstructions are superior to masked ones. The reported improvements on standard benchmarks provide indirect evidence of the method's effectiveness. To address this concern rigorously, we will add in the revised manuscript: (1) quantitative comparisons of full vs. masked reconstruction accuracy against ground truth on a subset of sequences, (2) failure case studies highlighting scenarios with textureless regions and specularities, and (3) discussion of conditions under which the premise holds. This will clarify the robustness of the self-supervision signal. revision: yes

  2. Referee: [Experiments] Experiments section (quantitative results and ablations): The reported average gains of 3.73% pose and 2.88% point-map accuracy are presented without ablation studies isolating the contributions of cross-view feature consistency versus pairwise relation preservation, without details on masking ratios or LoRA hyperparameters, and without analysis of variance across sequences. This makes it impossible to determine whether the gains are robust or sensitive to the specific self-supervision construction.

    Authors: We agree that additional details and ablations are necessary to demonstrate the robustness of the results. The original submission focused on overall performance but omitted component-wise ablations, specific hyperparameter values, and per-sequence variance. In the revision, we will include: ablations separating the effects of cross-view consistency and pairwise preservation, tables detailing masking ratios (e.g., 20-50%) and LoRA configurations (rank, alpha), and standard deviation or per-dataset variance analysis for the reported metrics. These additions will allow readers to assess sensitivity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-supervision uses held-out frames with external benchmark validation

full rationale

The paper's core mechanism generates a self-supervised consistency loss by comparing the model's output on a full test sequence against its output on a masked subset of the same sequence, then applies LoRA updates. This does not reduce to a tautology by construction because the full-sequence output is not mathematically forced to equal the masked output; the loss is minimized through parameter updates whose effect is measured on independent ground-truth benchmarks (camera pose and point map accuracy). No equations are presented that equate the target to the input by definition, no parameters are fitted on a subset and then renamed as a prediction of the same quantity, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation therefore remains self-contained against external evaluation rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that additional views improve reconstruction consistency, which is turned into a self-supervised objective; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption When the model receives more views, it produces more reliable and view-consistent reconstructions
    This property is directly invoked to justify masking frames and using the resulting consistency as supervision.

pith-pipeline@v0.9.0 · 5530 in / 1265 out tokens · 58752 ms · 2026-05-10T12:53:54.582395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    In: ICCV (2021)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)

  2. [2]

    In: International Conference on Learning Representations (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

  3. [3]

    In: NeurIPS (2020)

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Do- ersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko, M.: Bootstrap your own latent: A new approach to self-supervised learn- ing. In: NeurIPS (2020)

  4. [4]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  5. [5]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  6. [6]

    In: The Fourteenth International Conference on Learning Representations (2026)

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Zhao, Y., Peng, S., Guo, H., Zhou, X., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. In: The Fourteenth International Conference on Learning Representations (2026)

  7. [7]

    (eds.) Advances in Neural Infor- mation Processing Systems (2021)

    Liu, Y., Kothari, P., van Delft, B.G., Bellot-Gurlet, B., Mordan, T., Alahi, A.: TTT++: When does self-supervised test-time training fail or thrive? In: Beygelz- imer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Infor- mation Processing Systems (2021)

  8. [8]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  9. [9]

    TMLR (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: DINOv2: Learning robust visual features without supervision. TMLR (2024)

  10. [10]

    In: CVPR (2019)

    Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)

  11. [11]

    In: ICLR (2015)

    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)

  12. [12]

    In: Conference on Computer Vision and Pattern Recognition (2016)

    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (2016)

  13. [13]

    In: CVPR (2017)

    Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)

  14. [14]

    In: CVPR (2013)

    Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)

  15. [15]

    In: ICML (2020)

    Sun,Y.,Wang,X.,Zhuang,L.,Miller,J.,Hardt,M.,Efros,A.A.:Test-timetraining with self-supervision for generalization under distribution shifts. In: ICML (2020)

  16. [16]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991)

    Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991)

  17. [17]

    Dai and X

    Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test- timeadaptationbyentropyminimization.In:InternationalConferenceonLearning Representations (2021) 16 Y. Dai and X. Yang

  18. [18]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  20. [20]

    European Conference on Computer Vision (2018)

    Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc- tured multi-view stereo. European Conference on Computer Vision (2018)

  21. [21]

    In: ICCV (2023)

    Yeshwanth,C.,Liu,Y.C.,Nießner,M.,Dai,A.:ScanNet++:Ahigh-fidelitydataset of 3d indoor scenes. In: ICCV (2023)

  22. [22]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Yuan, Y., Shen, Q., Wang, S., Yang, X., Wang, X.: Test3r: Learning to reconstruct 3d at test time. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  23. [23]

    Zhang, M., Levine, S., Finn, C.: MEMO: Test time robustness via adaptation and augmentation. In: NeurIPS (2022) Free Geometry 1 Supplementary Material 1 Method Details 1.1 Free Geometry Self-Supervised Geometric Losses Free Geometry performs test-time adaptation through a self-supervised geo- metric objective defined between two branches of the same scene...