pith. machine review for the scientific record. sign in

arxiv: 2507.13347 · v3 · submitted 2025-07-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

π³: Permutation-Equivariant Visual Geometry Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords permutation equivariancevisual geometry reconstructioncamera pose estimationdepth estimationpoint mapsmulti-view geometryfeedforward networks
0
0 comments X

The pith

A fully permutation-equivariant network reconstructs visual geometry without fixing a reference view.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a neural network that processes sets of images for geometry reconstruction while treating every possible ordering of the inputs identically. Previous approaches typically select one view as the anchor for all other calculations, which can lead to poor results if that view is occluded or low-quality. By building permutation equivariance directly into the model, the predictions for camera positions and local points remain consistent regardless of input sequence. This leads to more reliable outputs on tasks like estimating camera motion from video frames or building 3D maps from multiple photos.

Core claim

We introduce π³, a feed-forward neural network for visual geometry reconstruction that employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes the model inherently robust to input ordering and achieves higher accuracy, enabling state-of-the-art performance on camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

What carries the argument

The fully permutation-equivariant architecture that enforces symmetry across input views to produce order-independent predictions of affine-invariant camera poses and scale-invariant local point maps.

If this is right

  • The model can process image sets in any order without changing the underlying geometry prediction.
  • It reaches state-of-the-art results on multiple visual geometry benchmarks without additional alignment procedures.
  • The same network supports both single-image depth estimation and multi-view tasks through its symmetric design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The equivariant design might naturally extend to handling dynamic scenes or objects by maintaining consistency across frames.
  • One could explore whether this removes the need for explicit regularization terms in the loss function for multi-view consistency.
  • Applying the method to extremely large unordered image collections could test its scalability beyond controlled datasets.

Load-bearing premise

That a permutation-equivariant feed-forward network will automatically produce consistent global geometry across different input orderings when trained only on typical visual datasets.

What would settle it

Run the trained model on a fixed set of images presented in two different orders and verify if the output camera poses and point maps can be aligned perfectly after permuting back; any residual differences after alignment would indicate a failure of the equivariance or consistency.

read the original abstract

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at https://github.com/yyfz/Pi3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces π³, a feed-forward neural network employing a fully permutation-equivariant architecture to perform visual geometry reconstruction. It predicts affine-invariant camera poses and scale-invariant local point maps directly from unordered input views, eliminating the conventional fixed reference view. The design is claimed to confer inherent robustness to input ordering, improved accuracy, and state-of-the-art results across camera pose estimation, monocular/video depth estimation, and dense point map reconstruction, with code and models released publicly.

Significance. If the central claims are substantiated, the work would represent a meaningful advance in multi-view geometry by removing a common inductive bias (reference-frame anchoring) that can cause instability. A verified permutation-equivariant feed-forward model capable of producing consistent global geometry from arbitrary orderings could simplify pipelines for 3D reconstruction and enable more flexible processing of image sets. The public code release is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.
  2. [Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.
minor comments (1)
  1. The abstract would benefit from a brief statement of the precise invariance properties (affine for poses, scale for points) and how they are enforced in the loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments on the role of permutation equivariance and the importance of verifying global consistency. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.

    Authors: We agree that architectural equivariance by itself does not automatically guarantee identical global 3D structure across input permutations; additional inductive biases from the training objective are required. In π³ the combination of the fully permutation-equivariant backbone with losses defined on affine-invariant relative poses and scale-invariant local point maps encourages the network to recover a single consistent scene geometry (up to the declared invariances) regardless of ordering. The absence of a designated reference view during both training and inference further removes the usual source of ordering-dependent drift. To make this explicit, the revised manuscript will include new experiments that quantify output consistency—specifically, relative pose error and point-map alignment error—when the identical set of views is fed in multiple random permutations. revision: yes

  2. Referee: [Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.

    Authors: All reported numbers were obtained by feeding unordered image sets directly into the model and using the raw outputs without any post-hoc alignment, reference-frame selection, or Procrustes step. Nevertheless, we acknowledge that explicit verification of ordering invariance is necessary to fully substantiate the claims. The revised version will therefore add dedicated ablations that (i) permute the input views in multiple ways, (ii) compute pairwise relative pose errors and point-map alignment errors across those permutations, and (iii) compare against baselines that rely on a fixed reference view. These results will be reported alongside the existing benchmark tables. revision: yes

Circularity Check

0 steps flagged

No circularity: novel architecture and training yield independent claims

full rationale

The paper's derivation chain consists of defining a new permutation-equivariant feed-forward network architecture, training it end-to-end on visual geometry tasks, and reporting empirical performance. No equations reduce predicted poses or point maps to quantities fitted from the authors' own prior outputs; the equivariance property is enforced by architectural construction rather than by re-labeling fitted parameters as predictions. No self-citation chain is invoked to establish uniqueness or forbid alternatives, and the central claims about reference-free consistency rest on the training distribution and loss rather than tautological re-derivation. The reported SOTA results are therefore external to the architectural definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The model relies on standard neural-network training assumptions and the mathematical property of permutation equivariance; no additional free parameters, ad-hoc axioms, or invented physical entities are introduced beyond the network itself.

axioms (1)
  • standard math Permutation equivariance of the network implies output invariance to input reordering
    Invoked in the abstract description of the architecture.

pith-pipeline@v0.9.0 · 5473 in / 1239 out tokens · 37814 ms · 2026-05-12T17:08:14.428406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

    cs.CV 2026-05 unverdicted novelty 7.0

    Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.

  2. Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

    cs.CV 2026-04 unverdicted novelty 7.0

    Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.

  3. Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

    cs.RO 2026-04 unverdicted novelty 7.0

    CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

  4. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  5. AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

    cs.RO 2026-04 unverdicted novelty 7.0

    AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...

  6. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  7. VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

    cs.CV 2026-03 unverdicted novelty 7.0

    VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.

  8. MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    cs.CV 2025-09 unverdicted novelty 7.0

    MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.

  9. Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

    cs.CV 2026-05 unverdicted novelty 6.0

    Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

  10. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

  11. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  12. AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

    cs.CV 2026-04 unverdicted novelty 6.0

    AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.

  13. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  14. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  15. Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

    cs.CV 2026-04 unverdicted novelty 6.0

    The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncerta...

  16. Self-Improving 4D Perception via Self-Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...

  17. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  18. SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes

    cs.CV 2026-04 unverdicted novelty 6.0

    SpectralSplat disentangles appearance from geometry in feed-forward 3D Gaussian Splatting by factoring color into base and adapted streams conditioned on DINOv2 embeddings, trained on paired data from a hybrid relight...

  19. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  20. Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.

  21. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  22. TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

    cs.CV 2026-03 unverdicted novelty 6.0

    TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.

  23. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  24. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  25. WildPose: A Unified Framework for Robust Pose Estimation in the Wild

    cs.CV 2026-05 unverdicted novelty 5.0

    WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.

  26. Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

    cs.CV 2026-05 unverdicted novelty 5.0

    A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.

  27. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  28. StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

    cs.CV 2026-04 unverdicted novelty 5.0

    StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

  29. MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

    cs.RO 2026-04 unverdicted novelty 5.0

    MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.

  30. MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

    cs.RO 2026-04 unverdicted novelty 5.0

    MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

  31. Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 4.0

    A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.

  32. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  33. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  34. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 34 Pith papers · 5 internal anchors

  1. [1]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897,

  2. [2]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561,

  3. [3]

    Grounding image matching in 3d with mast3r

    10 Published as a conference paper at ICLR 2026 Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pp. 71–91. Springer,

  4. [4]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050,

  5. [5]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  6. [6]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 501–518. Springer,

  7. [7]

    Indoor segmentation and sup- port inference from rgbd images

    11 Published as a conference paper at ICLR 2026 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and sup- port inference from rgbd images. InEuropean conference on computer vision, pp. 746–760. Springer,

  8. [8]

    Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. arXiv preprint arXiv:2412.06974,

  9. [9]

    Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,

    Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,

  10. [10]

    3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,

  11. [11]

    Vggt: Visual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025a. Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics and Automation Letters, 5(2):3307–3314,

  12. [12]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Con- tinuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 10510–10522, 2025b. Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate ...

  13. [13]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–

  14. [14]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,

  15. [15]

    URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf

    doi: 10.52202/ 079017-0688. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf. 12 Published as a conference paper at ICLR 2026 Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo n...

  16. [16]

    Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, De- qing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825,

  17. [17]

    Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,

  18. [18]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,

  19. [19]

    A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions

    and is then converted to a 3×3 rotation matrix via SVD orthogonalization. A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions. We visualize the predicted pose dis- tributions in 3D space.π 3 shows a clear low- dimensional structure, while VGGT’s distribution is scattered. We trainπ 3 in two stages, a process ...

  20. [20]

    cold start

    Regarding optimization, we set the initial learning rate for all model compo- nents to5×10 −5. We employ aOneCycleLRscheduler, where the learning rate anneals from its maximum value down to a minimal value over the entire training duration following a cosine curve. We use the same learning rate and scheduler settings for both stages. The confidence head i...

  21. [21]

    Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods

    samples during training time. Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods. 15 Published as a conference paper at ICLR 2026 A.6 ABLATIONDETAILS The primary difference between our full model and the ablated models (Model 1 and Model

  22. [22]

    Accordingly, in Tab

    and FLARE (Zhang et al., 2025). Accordingly, in Tab. 9, we present a full set of RRA, RTA, and AUC metrics across thresholds of1 ◦,3 ◦,5 ◦,10 ◦, and 15◦, evaluated on RealEstate10K. Ourπ 3 model demonstrates robust and consistent performance even with these more demanding constraints. Table 9:Camera pose estimation with tighter angular thresholds on RealE...

  23. [23]

    54.30 87.24 94.78 97.90 98.465.47 24.56 39.23 59.29 69.113.77 13.67 22.36 37.33 46.71CUT3R (Wang et al., 2025b) 78.6396.0698.1599.3199.6316.2351.4367.4482.9888.9313.4033.3945.6361.7870.15FLARE (Zhang et al.,

  24. [24]

    0.150 0.1110.048 0.018 0.150 0.097 0.061 0.024 3.134 1.476 0.875 0.646CUT3R (Wang et al., 2025b) 0.097 0.0490.0250.009 0.091 0.036 0.065 0.025 4.021 1.886 0.684 0.551FLARE (Zhang et al.,

  25. [25]

    We evaluate it on our standard benchmarks with input resolution 518, following CUT3R (Wang et al., 2025b) protocol

    0.115 0.0830.0230.010 0.052 0.023 0.020 0.009 2.834 1.409 0.564 0.377VGGT (Wang et al., 2025a)0.050 0.0290.024 0.010 0.058 0.032 0.015 0.007 1.619 0.888 0.287 0.177π3(Ours)0.061 0.039 0.019 0.009 0.026 0.013 0.013 0.006 1.472 0.626 0.199 0.128 Monocular depth estimationcompared with Depth Anything V2 (Yang et al., 2024), one of the SOTA models for monocul...