Recognition: 2 theorem links
· Lean Theoremπ³: Permutation-Equivariant Visual Geometry Learning
Pith reviewed 2026-05-12 17:08 UTC · model grok-4.3
The pith
A fully permutation-equivariant network reconstructs visual geometry without fixing a reference view.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce π³, a feed-forward neural network for visual geometry reconstruction that employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes the model inherently robust to input ordering and achieves higher accuracy, enabling state-of-the-art performance on camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.
What carries the argument
The fully permutation-equivariant architecture that enforces symmetry across input views to produce order-independent predictions of affine-invariant camera poses and scale-invariant local point maps.
If this is right
- The model can process image sets in any order without changing the underlying geometry prediction.
- It reaches state-of-the-art results on multiple visual geometry benchmarks without additional alignment procedures.
- The same network supports both single-image depth estimation and multi-view tasks through its symmetric design.
Where Pith is reading between the lines
- The equivariant design might naturally extend to handling dynamic scenes or objects by maintaining consistency across frames.
- One could explore whether this removes the need for explicit regularization terms in the loss function for multi-view consistency.
- Applying the method to extremely large unordered image collections could test its scalability beyond controlled datasets.
Load-bearing premise
That a permutation-equivariant feed-forward network will automatically produce consistent global geometry across different input orderings when trained only on typical visual datasets.
What would settle it
Run the trained model on a fixed set of images presented in two different orders and verify if the output camera poses and point maps can be aligned perfectly after permuting back; any residual differences after alignment would indicate a failure of the equivariance or consistency.
read the original abstract
We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at https://github.com/yyfz/Pi3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces π³, a feed-forward neural network employing a fully permutation-equivariant architecture to perform visual geometry reconstruction. It predicts affine-invariant camera poses and scale-invariant local point maps directly from unordered input views, eliminating the conventional fixed reference view. The design is claimed to confer inherent robustness to input ordering, improved accuracy, and state-of-the-art results across camera pose estimation, monocular/video depth estimation, and dense point map reconstruction, with code and models released publicly.
Significance. If the central claims are substantiated, the work would represent a meaningful advance in multi-view geometry by removing a common inductive bias (reference-frame anchoring) that can cause instability. A verified permutation-equivariant feed-forward model capable of producing consistent global geometry from arbitrary orderings could simplify pipelines for 3D reconstruction and enable more flexible processing of image sets. The public code release is a clear strength for reproducibility.
major comments (2)
- [Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.
- [Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.
minor comments (1)
- The abstract would benefit from a brief statement of the precise invariance properties (affine for poses, scale for points) and how they are enforced in the loss.
Simulated Author's Rebuttal
We thank the referee for the constructive and precise comments on the role of permutation equivariance and the importance of verifying global consistency. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.
Authors: We agree that architectural equivariance by itself does not automatically guarantee identical global 3D structure across input permutations; additional inductive biases from the training objective are required. In π³ the combination of the fully permutation-equivariant backbone with losses defined on affine-invariant relative poses and scale-invariant local point maps encourages the network to recover a single consistent scene geometry (up to the declared invariances) regardless of ordering. The absence of a designated reference view during both training and inference further removes the usual source of ordering-dependent drift. To make this explicit, the revised manuscript will include new experiments that quantify output consistency—specifically, relative pose error and point-map alignment error—when the identical set of views is fed in multiple random permutations. revision: yes
-
Referee: [Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.
Authors: All reported numbers were obtained by feeding unordered image sets directly into the model and using the raw outputs without any post-hoc alignment, reference-frame selection, or Procrustes step. Nevertheless, we acknowledge that explicit verification of ordering invariance is necessary to fully substantiate the claims. The revised version will therefore add dedicated ablations that (i) permute the input views in multiple ways, (ii) compute pairwise relative pose errors and point-map alignment errors across those permutations, and (iii) compare against baselines that rely on a fixed reference view. These results will be reported alongside the existing benchmark tables. revision: yes
Circularity Check
No circularity: novel architecture and training yield independent claims
full rationale
The paper's derivation chain consists of defining a new permutation-equivariant feed-forward network architecture, training it end-to-end on visual geometry tasks, and reporting empirical performance. No equations reduce predicted poses or point maps to quantities fitted from the authors' own prior outputs; the equivariance property is enforced by architectural construction rather than by re-labeling fitted parameters as predictions. No self-citation chain is invoked to establish uniqueness or forbid alternatives, and the central claims about reference-free consistency rest on the training distribution and loss rather than tautological re-derivation. The reported SOTA results are therefore external to the architectural definition itself.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Permutation equivariance of the network implies output invariance to input reordering
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
π3 employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 34 Pith papers
-
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
-
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
-
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
-
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
-
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors
The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncerta...
-
Self-Improving 4D Perception via Self-Distillation
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
-
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
-
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
SpectralSplat disentangles appearance from geometry in feed-forward 3D Gaussian Splatting by factoring color into base and adapted streams conditioned on DINOv2 embeddings, trained on paired data from a hybrid relight...
-
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
-
Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
-
Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates
A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
-
MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos
MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
-
Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction
A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897,
work page internal anchor Pith review arXiv
-
[2]
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561,
work page internal anchor Pith review arXiv
-
[3]
Grounding image matching in 3d with mast3r
10 Published as a conference paper at ICLR 2026 Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pp. 71–91. Springer,
work page 2026
-
[4]
Megadepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050,
work page 2041
-
[5]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 501–518. Springer,
work page 2016
-
[7]
Indoor segmentation and sup- port inference from rgbd images
11 Published as a conference paper at ICLR 2026 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and sup- port inference from rgbd images. InEuropean conference on computer vision, pp. 746–760. Springer,
work page 2026
-
[8]
Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. arXiv preprint arXiv:2412.06974,
-
[9]
Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,
Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,
-
[10]
3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,
-
[11]
Vggt: Visual geometry grounded transformer, 2025
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025a. Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics and Automation Letters, 5(2):3307–3314,
-
[12]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Con- tinuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 10510–10522, 2025b. Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate ...
work page internal anchor Pith review arXiv
-
[13]
Tartanair: A dataset to push the limits of visual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–
work page 2020
-
[14]
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,
-
[15]
doi: 10.52202/ 079017-0688. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf. 12 Published as a conference paper at ICLR 2026 Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo n...
work page 2024
-
[16]
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, De- qing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825,
-
[17]
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,
-
[18]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,
work page internal anchor Pith review arXiv
-
[19]
A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions
and is then converted to a 3×3 rotation matrix via SVD orthogonalization. A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions. We visualize the predicted pose dis- tributions in 3D space.π 3 shows a clear low- dimensional structure, while VGGT’s distribution is scattered. We trainπ 3 in two stages, a process ...
work page 2024
-
[20]
Regarding optimization, we set the initial learning rate for all model compo- nents to5×10 −5. We employ aOneCycleLRscheduler, where the learning rate anneals from its maximum value down to a minimal value over the entire training duration following a cosine curve. We use the same learning rate and scheduler settings for both stages. The confidence head i...
work page 2026
-
[21]
Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods
samples during training time. Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods. 15 Published as a conference paper at ICLR 2026 A.6 ABLATIONDETAILS The primary difference between our full model and the ablated models (Model 1 and Model
work page 2026
-
[22]
and FLARE (Zhang et al., 2025). Accordingly, in Tab. 9, we present a full set of RRA, RTA, and AUC metrics across thresholds of1 ◦,3 ◦,5 ◦,10 ◦, and 15◦, evaluated on RealEstate10K. Ourπ 3 model demonstrates robust and consistent performance even with these more demanding constraints. Table 9:Camera pose estimation with tighter angular thresholds on RealE...
work page 2025
- [23]
- [24]
-
[25]
0.115 0.0830.0230.010 0.052 0.023 0.020 0.009 2.834 1.409 0.564 0.377VGGT (Wang et al., 2025a)0.050 0.0290.024 0.010 0.058 0.032 0.015 0.007 1.619 0.888 0.287 0.177π3(Ours)0.061 0.039 0.019 0.009 0.026 0.013 0.013 0.006 1.472 0.626 0.199 0.128 Monocular depth estimationcompared with Depth Anything V2 (Yang et al., 2024), one of the SOTA models for monocul...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.