$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Chunhua Shen; Haoyi Zhu; Jiangmiao Pang; Jianjun Zhou; Junyi Chen; Tong He; Wenzheng Chang; Yang Zhou; Yifan Wang; Zizun Li

arxiv: 2507.13347 · v3 · submitted 2025-07-17 · 💻 cs.CV

π³: Permutation-Equivariant Visual Geometry Learning

Yifan Wang , Jianjun Zhou , Haoyi Zhu , Wenzheng Chang , Yang Zhou , Zizun Li , Junyi Chen , Jiangmiao Pang

show 2 more authors

Chunhua Shen Tong He

This is my paper

Pith reviewed 2026-05-12 17:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords permutation equivariancevisual geometry reconstructioncamera pose estimationdepth estimationpoint mapsmulti-view geometryfeedforward networks

0 comments

The pith

A fully permutation-equivariant network reconstructs visual geometry without fixing a reference view.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a neural network that processes sets of images for geometry reconstruction while treating every possible ordering of the inputs identically. Previous approaches typically select one view as the anchor for all other calculations, which can lead to poor results if that view is occluded or low-quality. By building permutation equivariance directly into the model, the predictions for camera positions and local points remain consistent regardless of input sequence. This leads to more reliable outputs on tasks like estimating camera motion from video frames or building 3D maps from multiple photos.

Core claim

We introduce π³, a feed-forward neural network for visual geometry reconstruction that employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes the model inherently robust to input ordering and achieves higher accuracy, enabling state-of-the-art performance on camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

What carries the argument

The fully permutation-equivariant architecture that enforces symmetry across input views to produce order-independent predictions of affine-invariant camera poses and scale-invariant local point maps.

If this is right

The model can process image sets in any order without changing the underlying geometry prediction.
It reaches state-of-the-art results on multiple visual geometry benchmarks without additional alignment procedures.
The same network supports both single-image depth estimation and multi-view tasks through its symmetric design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivariant design might naturally extend to handling dynamic scenes or objects by maintaining consistency across frames.
One could explore whether this removes the need for explicit regularization terms in the loss function for multi-view consistency.
Applying the method to extremely large unordered image collections could test its scalability beyond controlled datasets.

Load-bearing premise

That a permutation-equivariant feed-forward network will automatically produce consistent global geometry across different input orderings when trained only on typical visual datasets.

What would settle it

Run the trained model on a fixed set of images presented in two different orders and verify if the output camera poses and point maps can be aligned perfectly after permuting back; any residual differences after alignment would indicate a failure of the equivariance or consistency.

read the original abstract

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at https://github.com/yyfz/Pi3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces π³, a feed-forward neural network employing a fully permutation-equivariant architecture to perform visual geometry reconstruction. It predicts affine-invariant camera poses and scale-invariant local point maps directly from unordered input views, eliminating the conventional fixed reference view. The design is claimed to confer inherent robustness to input ordering, improved accuracy, and state-of-the-art results across camera pose estimation, monocular/video depth estimation, and dense point map reconstruction, with code and models released publicly.

Significance. If the central claims are substantiated, the work would represent a meaningful advance in multi-view geometry by removing a common inductive bias (reference-frame anchoring) that can cause instability. A verified permutation-equivariant feed-forward model capable of producing consistent global geometry from arbitrary orderings could simplify pipelines for 3D reconstruction and enable more flexible processing of image sets. The public code release is a clear strength for reproducibility.

major comments (2)

[Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.
[Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.

minor comments (1)

The abstract would benefit from a brief statement of the precise invariance properties (affine for poses, scale for points) and how they are enforced in the loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments on the role of permutation equivariance and the importance of verifying global consistency. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The abstract's core assertion that the permutation-equivariant design produces 'inherently robust' reconstructions 'without any reference frames' or 'post-processing alignment step' is load-bearing for the SOTA claims. Equivariance guarantees only that f(perm(X)) = perm(f(X)); it does not by itself ensure that the induced global 3D structure (poses and point maps) remains identical across permutations. The manuscript must demonstrate, via explicit experiments, that the training distribution and loss alone enforce this global consistency rather than merely local equivariance.

Authors: We agree that architectural equivariance by itself does not automatically guarantee identical global 3D structure across input permutations; additional inductive biases from the training objective are required. In π³ the combination of the fully permutation-equivariant backbone with losses defined on affine-invariant relative poses and scale-invariant local point maps encourages the network to recover a single consistent scene geometry (up to the declared invariances) regardless of ordering. The absence of a designated reference view during both training and inference further removes the usual source of ordering-dependent drift. To make this explicit, the revised manuscript will include new experiments that quantify output consistency—specifically, relative pose error and point-map alignment error—when the identical set of views is fed in multiple random permutations. revision: yes
Referee: [Experiments] The reported performance gains on pose estimation and dense reconstruction rest on the premise that no evaluation-time alignment is required. Without ablations that measure output consistency (e.g., relative pose error or point-map alignment error) when the same set of views is presented in different orders, it remains unclear whether the SOTA numbers reflect true order-invariance or implicit alignment during evaluation.

Authors: All reported numbers were obtained by feeding unordered image sets directly into the model and using the raw outputs without any post-hoc alignment, reference-frame selection, or Procrustes step. Nevertheless, we acknowledge that explicit verification of ordering invariance is necessary to fully substantiate the claims. The revised version will therefore add dedicated ablations that (i) permute the input views in multiple ways, (ii) compute pairwise relative pose errors and point-map alignment errors across those permutations, and (iii) compare against baselines that rely on a fixed reference view. These results will be reported alongside the existing benchmark tables. revision: yes

Circularity Check

0 steps flagged

No circularity: novel architecture and training yield independent claims

full rationale

The paper's derivation chain consists of defining a new permutation-equivariant feed-forward network architecture, training it end-to-end on visual geometry tasks, and reporting empirical performance. No equations reduce predicted poses or point maps to quantities fitted from the authors' own prior outputs; the equivariance property is enforced by architectural construction rather than by re-labeling fitted parameters as predictions. No self-citation chain is invoked to establish uniqueness or forbid alternatives, and the central claims about reference-free consistency rest on the training distribution and loss rather than tautological re-derivation. The reported SOTA results are therefore external to the architectural definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The model relies on standard neural-network training assumptions and the mathematical property of permutation equivariance; no additional free parameters, ad-hoc axioms, or invented physical entities are introduced beyond the network itself.

axioms (1)

standard math Permutation equivariance of the network implies output invariance to input reordering
Invoked in the abstract description of the architecture.

pith-pipeline@v0.9.0 · 5473 in / 1239 out tokens · 37814 ms · 2026-05-12T17:08:14.428406+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

π3 employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
cs.CV 2025-12 unverdicted novelty 8.0

FUSER is the first feed-forward multiview 3D registration transformer that jointly processes all scans to predict global poses, followed by SE(3)^N diffusion refinement for higher accuracy.
Geo-Align: Video Generation Alignment via Metric Geometry Reward
cs.CV 2026-05 unverdicted novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth
cs.CV 2026-05 unverdicted novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
cs.CV 2026-05 unverdicted novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
cs.RO 2026-05 unverdicted novelty 7.0

A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
cs.CV 2026-05 unverdicted novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
cs.CV 2026-05 unverdicted novelty 7.0

Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
cs.CV 2026-04 unverdicted novelty 7.0

Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
cs.RO 2026-04 unverdicted novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
cs.RO 2026-04 unverdicted novelty 7.0

AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
cs.CV 2026-03 unverdicted novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
cs.CV 2025-09 unverdicted novelty 7.0

MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
cs.CV 2025-09 conditional novelty 7.0

FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

Sensor2Sensor converts in-the-wild monocular dashcam videos into high-fidelity multi-modal AV sensor data using 4D Gaussian Splatting to synthesize training pairs and a diffusion model for the cross-embodiment translation.
UniT: Unified Geometry Learning with Group Autoregressive Transformer
cs.CV 2026-05 unverdicted novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
cs.CV 2026-05 unverdicted novelty 6.0

TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datas...
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
cs.CV 2026-05 unverdicted novelty 6.0

A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos
cs.CV 2026-05 unverdicted novelty 6.0

LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 6.0

EndoGSim integrates MLLM-guided material initialization with 4D Gaussian Splatting and differentiable Material Point Method to achieve physics-aware 4D reconstruction and simulation of endoscopic scenes.
Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...
Unlocking Dense Metric Depth Estimation in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
MeshReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
MeshReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
cs.CV 2026-04 unverdicted novelty 6.0

AnyRecon enables scalable 3D reconstruction from arbitrary sparse unordered views by combining video diffusion with explicit global geometric memory and retrieval to maintain consistency across large viewpoint changes.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors
cs.CV 2026-04 unverdicted novelty 6.0

The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncerta...
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
SpectralSplat: Appearance-Disentangled Feed-Forward Gaussian Splatting for Driving Scenes
cs.CV 2026-04 unverdicted novelty 6.0

SpectralSplat disentangles appearance from geometry in feed-forward 3D Gaussian Splatting by factoring color into base and adapted streams conditioned on DINOv2 embeddings, trained on paired data from a hybrid relight...
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
cs.CV 2026-04 unverdicted novelty 6.0

UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
cs.CV 2026-03 unverdicted novelty 6.0

TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
cs.CV 2026-03 unverdicted novelty 6.0

GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.
TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
cs.CV 2026-03 unverdicted novelty 6.0

TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.
Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
cs.RO 2026-02 unverdicted novelty 6.0

Robo3R predicts accurate metric-scale 3D scene geometry from RGB images and robot states for improved robotic manipulation performance.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matchi...
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
cs.CV 2026-05 unverdicted novelty 5.0

WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates
cs.CV 2026-05 unverdicted novelty 5.0

A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
cs.CV 2026-04 unverdicted novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos
cs.RO 2026-04 unverdicted novelty 5.0

MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
cs.RO 2026-04 unverdicted novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
cs.CV 2025-11 unverdicted novelty 5.0

MetroGS combines distributed 2D Gaussian Splatting with structured dense enhancement, progressive hybrid optimization, and depth-guided appearance modeling to deliver higher geometric accuracy and stability in large-s...
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
cs.CV 2025-09 unverdicted novelty 5.0

Block-sparse global attention accelerates multi-view reconstruction transformers by over 3x by exploiting concentrated attention on cross-view correspondences.
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
ShowMak3r: Compositional TV Show Reconstruction
cs.CV 2025-04 unverdicted novelty 5.0

ShowMak3r reconstructs dynamic TV show scenes from video using 3D actor localization, shot matching, and expression fitting to enable new camera views and scene edits.
Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 4.0

A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 58 Pith papers · 7 internal anchors

[1]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897,

work page internal anchor Pith review arXiv
[2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561,

work page internal anchor Pith review arXiv
[3]

Grounding image matching in 3d with mast3r

10 Published as a conference paper at ICLR 2026 Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pp. 71–91. Springer,

work page 2026
[4]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050,

work page 2041
[5]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 501–518. Springer,

work page 2016
[7]

Indoor segmentation and sup- port inference from rgbd images

11 Published as a conference paper at ICLR 2026 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and sup- port inference from rgbd images. InEuropean conference on computer vision, pp. 746–760. Springer,

work page 2026
[8]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds,

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. arXiv preprint arXiv:2412.06974,

work page arXiv
[9]

Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,

work page arXiv
[10]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,

work page internal anchor Pith review arXiv
[11]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025a. Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics and Automation Letters, 5(2):3307–3314,

work page arXiv
[12]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Con- tinuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 10510–10522, 2025b. Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate ...

work page internal anchor Pith review arXiv
[13]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–

work page 2020
[14]

arXiv preprint arXiv:2501.13928 (2025)

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,

work page arXiv
[15]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf

doi: 10.52202/ 079017-0688. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf. 12 Published as a conference paper at ICLR 2026 Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo n...

work page 2024
[16]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, De- qing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825,

work page internal anchor Pith review arXiv
[17]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,

work page arXiv
[18]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,

work page internal anchor Pith review arXiv
[19]

A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions

and is then converted to a 3×3 rotation matrix via SVD orthogonalization. A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions. We visualize the predicted pose dis- tributions in 3D space.π 3 shows a clear low- dimensional structure, while VGGT’s distribution is scattered. We trainπ 3 in two stages, a process ...

work page 2024
[20]

cold start

Regarding optimization, we set the initial learning rate for all model compo- nents to5×10 −5. We employ aOneCycleLRscheduler, where the learning rate anneals from its maximum value down to a minimal value over the entire training duration following a cosine curve. We use the same learning rate and scheduler settings for both stages. The confidence head i...

work page 2026
[21]

Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods

samples during training time. Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods. 15 Published as a conference paper at ICLR 2026 A.6 ABLATIONDETAILS The primary difference between our full model and the ablated models (Model 1 and Model

work page 2026
[22]

Accordingly, in Tab

and FLARE (Zhang et al., 2025). Accordingly, in Tab. 9, we present a full set of RRA, RTA, and AUC metrics across thresholds of1 ◦,3 ◦,5 ◦,10 ◦, and 15◦, evaluated on RealEstate10K. Ourπ 3 model demonstrates robust and consistent performance even with these more demanding constraints. Table 9:Camera pose estimation with tighter angular thresholds on RealE...

work page 2025
[23]

54.30 87.24 94.78 97.90 98.465.47 24.56 39.23 59.29 69.113.77 13.67 22.36 37.33 46.71CUT3R (Wang et al., 2025b) 78.6396.0698.1599.3199.6316.2351.4367.4482.9888.9313.4033.3945.6361.7870.15FLARE (Zhang et al.,

work page arXiv
[24]

0.150 0.1110.048 0.018 0.150 0.097 0.061 0.024 3.134 1.476 0.875 0.646CUT3R (Wang et al., 2025b) 0.097 0.0490.0250.009 0.091 0.036 0.065 0.025 4.021 1.886 0.684 0.551FLARE (Zhang et al.,

work page arXiv
[25]

We evaluate it on our standard benchmarks with input resolution 518, following CUT3R (Wang et al., 2025b) protocol

0.115 0.0830.0230.010 0.052 0.023 0.020 0.009 2.834 1.409 0.564 0.377VGGT (Wang et al., 2025a)0.050 0.0290.024 0.010 0.058 0.032 0.015 0.007 1.619 0.888 0.287 0.177π3(Ours)0.061 0.039 0.019 0.009 0.026 0.013 0.013 0.006 1.472 0.626 0.199 0.128 Monocular depth estimationcompared with Depth Anything V2 (Yang et al., 2024), one of the SOTA models for monocul...

work page arXiv 2024

[1] [1]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897,

work page internal anchor Pith review arXiv

[2] [2]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research.arXiv preprint arXiv:2308.13561,

work page internal anchor Pith review arXiv

[3] [3]

Grounding image matching in 3d with mast3r

10 Published as a conference paper at ICLR 2026 Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pp. 71–91. Springer,

work page 2026

[4] [4]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050,

work page 2041

[5] [5]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 501–518. Springer,

work page 2016

[7] [7]

Indoor segmentation and sup- port inference from rgbd images

11 Published as a conference paper at ICLR 2026 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and sup- port inference from rgbd images. InEuropean conference on computer vision, pp. 746–760. Springer,

work page 2026

[8] [8]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds,

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. arXiv preprint arXiv:2412.06974,

work page arXiv

[9] [9]

Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945,

work page arXiv

[10] [10]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,

work page internal anchor Pith review arXiv

[11] [11]

Vggt: Visual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025a. Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics and Automation Letters, 5(2):3307–3314,

work page arXiv

[12] [12]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Con- tinuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 10510–10522, 2025b. Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate ...

work page internal anchor Pith review arXiv

[13] [13]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–

work page 2020

[14] [14]

arXiv preprint arXiv:2501.13928 (2025)

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,

work page arXiv

[15] [15]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf

doi: 10.52202/ 079017-0688. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/26cfdcd8fe6fd75cc53e92963a656c58-Paper-Conference.pdf. 12 Published as a conference paper at ICLR 2026 Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo n...

work page 2024

[16] [16]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, De- qing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825,

work page internal anchor Pith review arXiv

[17] [17]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,

work page arXiv

[18] [18]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,

work page internal anchor Pith review arXiv

[19] [19]

A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions

and is then converted to a 3×3 rotation matrix via SVD orthogonalization. A.2 TRAININGDETAILS 𝑦 𝑥𝑧 𝑦 𝑥𝑧 VGGT 𝜋! (ours) Figure 6:Comparison of predicted pose dis- tributions. We visualize the predicted pose dis- tributions in 3D space.π 3 shows a clear low- dimensional structure, while VGGT’s distribution is scattered. We trainπ 3 in two stages, a process ...

work page 2024

[20] [20]

cold start

Regarding optimization, we set the initial learning rate for all model compo- nents to5×10 −5. We employ aOneCycleLRscheduler, where the learning rate anneals from its maximum value down to a minimal value over the entire training duration following a cosine curve. We use the same learning rate and scheduler settings for both stages. The confidence head i...

work page 2026

[21] [21]

Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods

samples during training time. Zero-shot pose estimation accuracy is evaluated on Sintel and TUM-dynamics for all methods. 15 Published as a conference paper at ICLR 2026 A.6 ABLATIONDETAILS The primary difference between our full model and the ablated models (Model 1 and Model

work page 2026

[22] [22]

Accordingly, in Tab

and FLARE (Zhang et al., 2025). Accordingly, in Tab. 9, we present a full set of RRA, RTA, and AUC metrics across thresholds of1 ◦,3 ◦,5 ◦,10 ◦, and 15◦, evaluated on RealEstate10K. Ourπ 3 model demonstrates robust and consistent performance even with these more demanding constraints. Table 9:Camera pose estimation with tighter angular thresholds on RealE...

work page 2025

[23] [23]

54.30 87.24 94.78 97.90 98.465.47 24.56 39.23 59.29 69.113.77 13.67 22.36 37.33 46.71CUT3R (Wang et al., 2025b) 78.6396.0698.1599.3199.6316.2351.4367.4482.9888.9313.4033.3945.6361.7870.15FLARE (Zhang et al.,

work page arXiv

[24] [24]

0.150 0.1110.048 0.018 0.150 0.097 0.061 0.024 3.134 1.476 0.875 0.646CUT3R (Wang et al., 2025b) 0.097 0.0490.0250.009 0.091 0.036 0.065 0.025 4.021 1.886 0.684 0.551FLARE (Zhang et al.,

work page arXiv

[25] [25]

We evaluate it on our standard benchmarks with input resolution 518, following CUT3R (Wang et al., 2025b) protocol

0.115 0.0830.0230.010 0.052 0.023 0.020 0.009 2.834 1.409 0.564 0.377VGGT (Wang et al., 2025a)0.050 0.0290.024 0.010 0.058 0.032 0.015 0.007 1.619 0.888 0.287 0.177π3(Ours)0.061 0.039 0.019 0.009 0.026 0.013 0.013 0.006 1.472 0.626 0.199 0.128 Monocular depth estimationcompared with Depth Anything V2 (Yang et al., 2024), one of the SOTA models for monocul...

work page arXiv 2024