GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Hanrui Cheng; Junda Cheng; Longliang Liu; Wenjing Liao; Xin Yang; Yuecheng Liu; Yuzhou Wang

arxiv: 2605.10525 · v4 · pith:UXKCP4VWnew · submitted 2026-05-11 · 💻 cs.CV

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Yuecheng Liu , Junda Cheng , Longliang Liu , Wenjing Liao , Hanrui Cheng , Yuzhou Wang , Xin Yang This is my paper

Pith reviewed 2026-05-20 22:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords video depth estimation3D geometric consistencycamera pose predictiongeometric embeddingsspatio-temporal transformerdynamic scenesmonocular depth

0 comments

The pith

GemDepth embeds predicted camera poses to enforce strict 3D consistency in video depth estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing video depth methods rely on transformer-based temporal smoothing and therefore lose geometric consistency when the camera rotates or the viewpoint shifts sharply. GemDepth counters this by inserting a Geometry-Embedding Module that first predicts inter-frame camera poses and then turns those poses into implicit geometric embeddings. These embeddings are fed to an Alternating Spatio-Temporal Transformer that finds point-level correspondences across frames. The result is sharper spatial detail together with stronger temporal coherence, all trained with a data-efficient strategy. If correct, the method supplies the missing 3D awareness that current smoothing-only pipelines lack.

Core claim

GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. These embeddings supply intrinsic 3D perception and alignment. Guided by the embeddings, an Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences that simultaneously sharpen fine details and enforce rigorous temporal consistency. The framework is trained with a data-efficient strategy and reports state-of-the-art results on multiple datasets, especially in complex dynamic scenes.

What carries the argument

The Geometry-Embedding Module (GEM) that predicts inter-frame camera poses and converts them into implicit geometric embeddings for 3D perception and alignment.

If this is right

Spatial blurring in fine-detail regions is reduced because point-level correspondences are recovered with geometric guidance.
Temporal inconsistencies disappear under rotations and drastic view changes once motion priors are injected.
State-of-the-art depth accuracy holds across multiple public datasets, particularly in dynamic scenes.
A data-efficient training schedule maintains high performance without requiring massive additional labeled video.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric embeddings could be reused as input features for downstream tasks such as video object segmentation or novel-view synthesis.
Evaluating the model on camera trajectories with extreme angular velocity not present in current benchmarks would test how far the 3D consistency extends.
Real-time robotics or AR pipelines that already estimate camera pose could adopt the embeddings with little extra cost.

Load-bearing premise

Video depth networks achieve strict 3D geometric consistency only when they receive explicit awareness of camera motion and global 3D structure.

What would settle it

On a test set of video sequences containing known large rotations or sudden viewpoint changes, remove the Geometry-Embedding Module and measure whether 3D consistency metrics drop significantly compared with the full model.

Figures

Figures reproduced from arXiv: 2605.10525 by Hanrui Cheng, Junda Cheng, Longliang Liu, Wenjing Liao, Xin Yang, Yuecheng Liu, Yuzhou Wang.

**Figure 1.** Figure 1: Qualitative comparison on the KITTI (Geiger et al., 2013) dataset. Compared with the state-of-the-art VideoDepthAnything (VDA) (Chen et al., 2025), our GemDepth-VDA demonstrates superior capability in resolving intricate background details and preserving fine structures. Left: Full-resolution of the video depth sequences; Right: Zoomed-in views. Abstract Video depth estimation extends monocular prediction… view at source ↗

**Figure 2.** Figure 2: Leaderboard performance. The radar chart illustrates that GemDepth ranks first across all four benchmark video datasets, significantly advancing the state-of-the-art. The reported metric is the AbsRel error. 1 arXiv:2605.10525v1 [cs.CV] 11 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of zero-shot point clouds on KITTI. Accumulated from 10 consecutive frames using GT poses, GemDepth demonstrates superior 3D temporal consistency, maintaining structural integrity even under camera rotation and large ego-motion. 1. Introduction Monocular depth estimation is a cornerstone of computer vision, underpinning applications from autonomous driving to augmented reality (Holynski & Ko… view at source ↗

**Figure 4.** Figure 4: Overview of GemDepth-DAv2. Built upon the ViT-based encoder and DPT head of DepthAnythingV2 (Yang et al., 2024b), GemDepth-DAv2 incorporates two novel components: Geometry-Embedding Module (GEM) and Alternating Spatio-Temporal Transformer (ASTT). By synergistically aggregating 3D geometric constraints and multi-scale spatio-temporal interactions, GemDepthDAv2 effectively addresses the long-standing incons… view at source ↗

**Figure 5.** Figure 5: Qualitative results of temporal consistency on videos of varying lengths. We compared GemDepth-VDA with DepthAnythingV2 (Yang et al., 2024b) and VideoDepthAnything (Chen et al., 2025) using sequences of increasing lengths from Sintel (Butler et al., 2012), DAVIS (Perazzi et al., 2016), KITTI (Geiger et al., 2013), and in-the-wild datasets. The red boxes highlight the temporally inconsistent depth estimati… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of spatial accuracy on Sintel (Butler et al., 2012), Bonn (Palazzolo et al., 2019), and Scannet (Dai et al., 2017). As indicated by the white arrows, GemDepth-VDA outperforms existing approaches in recovering background depth and preserving fine structural details. metric integrity of dynamic subjects. To assess temporal stability, we visualize temporal profiles in [PITH_FULL_IMAGE… view at source ↗

**Figure 8.** Figure 8: Placement of ASTT across different feature processing stages. Effectiveness of the two-stage training strategy. To validate the efficacy of our proposed two-stage training strategy, we assess model performance upon the completion of each phase on four benchmarks. As detailed in Tab. 6, we observe a consistent upward trajectory for both GemDepth-DAv2 and GemDepth-VDA: performance monotonically improves as… view at source ↗

**Figure 9.** Figure 9: More qualitative comparison on the KITTI (Geiger et al., 2013) dataset. B.2. Quantitative Results on Videos of Varying Lengths We evaluate the robustness of our model across varying video lengths using the Bonn (Palazzolo et al., 2019) dataset as a benchmark. Specifically, we conduct a comparative analysis between two versions of our GemDepth and the current state-of-the-art method, VideoDepthAnything (VDA… view at source ↗

read the original abstract

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GemDepth adds pose-derived geometric embeddings and alternating attention to video depth, but its effectiveness in dynamic scenes hinges on reliable pose prediction.

read the letter

The key takeaway is that GemDepth proposes injecting explicit geometric information from predicted camera poses into the depth network via a dedicated module, combined with an alternating attention mechanism to improve both spatial sharpness and temporal coherence in video depth estimation. What the paper does is introduce the Geometry-Embedding Module, or GEM, which predicts inter-frame poses and creates implicit geometric embeddings from them. These are then used to guide the Alternating Spatio-Temporal Transformer, or ASTT, which switches between handling spatial details and temporal relations. This is positioned as a way to achieve better 3D consistency than pure temporal smoothing approaches, especially in scenes with rotations or big view changes. They also talk about a data-efficient training strategy. The evaluations are said to show state-of-the-art results on several datasets, with particular strength in complex dynamic scenarios, and the code is released. This approach has some merit in addressing the geometric consistency problem head-on instead of relying solely on learned temporal patterns. The specific design of turning poses into embeddings and alternating the attention types gives it a distinct flavor compared to standard video depth models. On the downside, the stress test raises a fair point about pose reliability. In dynamic scenes with moving objects, predicting accurate camera poses from monocular video is challenging because most pose estimation methods assume the scene is mostly static. Errors there could weaken the geometric embeddings and thus the consistency gains. Without seeing detailed ablations or results broken down by scene type, it's unclear how robust this is. The abstract is light on specific metrics, so the full paper needs to demonstrate that the improvements are substantial and not just from better overall architecture. This paper would be of interest to people in computer vision working on monocular depth and video processing, particularly those targeting applications in augmented reality or robotic navigation where consistent depth over time matters. It has a clear technical proposal and engages with a relevant limitation in existing methods, so it should go to peer review for a proper evaluation of the claims and experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes GemDepth, a video depth estimation framework that introduces a Geometry-Embedding Module (GEM) to predict inter-frame camera poses and generate implicit geometric embeddings. These embeddings guide an Alternating Spatio-Temporal Transformer (ASTT) to capture latent point-level correspondences, aiming to improve spatial precision for fine details and enforce rigorous temporal consistency. The method employs a data-efficient training strategy and claims state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios.

Significance. If the results hold, the explicit incorporation of camera motion priors for intrinsic 3D perception represents a meaningful architectural shift from purely transformer-based temporal smoothing. This could improve robustness under rotations and drastic view changes. Public code availability aids reproducibility and allows direct verification of the geometric consistency claims.

major comments (2)

[GEM description and experiments] The central claim of strict 3D geometric consistency in complex dynamic scenarios rests on the reliability of pose predictions from the Geometry-Embedding Module (GEM). However, moving objects violate the static-scene assumption underlying monocular pose estimation, which could render the injected motion priors ineffective. The manuscript should include quantitative pose error metrics (rotation/translation) and ablations isolating GEM's contribution on dynamic subsets of the evaluated datasets.
[Evaluation and results] The abstract asserts SOTA results and highlights performance in dynamic scenarios, yet supplies no quantitative metrics, error bars, dataset details, or ablation tables. Without these in the full manuscript (e.g., in the evaluation section or Table 1/2), it is impossible to assess whether the data support the geometric-consistency argument over standard depth metrics.

minor comments (2)

[Figure 2] Figure 2 is referenced for comprehensive evaluations but lacks a clear caption or legend explaining the visualized consistency metrics.
[Method overview] Notation for the implicit geometric embeddings and how they are injected into ASTT could be formalized with an equation or diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [GEM description and experiments] The central claim of strict 3D geometric consistency in complex dynamic scenarios rests on the reliability of pose predictions from the Geometry-Embedding Module (GEM). However, moving objects violate the static-scene assumption underlying monocular pose estimation, which could render the injected motion priors ineffective. The manuscript should include quantitative pose error metrics (rotation/translation) and ablations isolating GEM's contribution on dynamic subsets of the evaluated datasets.

Authors: We agree that explicit validation of the pose predictions is important, particularly under dynamic conditions where the static-scene assumption may be violated. Although GEM is trained end-to-end to produce geometric embeddings that support 3D consistency, we will add quantitative pose error metrics (rotation and translation errors) on the evaluated datasets. We will also include new ablations that isolate GEM's contribution specifically on dynamic subsets to directly address this concern. revision: yes
Referee: [Evaluation and results] The abstract asserts SOTA results and highlights performance in dynamic scenarios, yet supplies no quantitative metrics, error bars, dataset details, or ablation tables. Without these in the full manuscript (e.g., in the evaluation section or Table 1/2), it is impossible to assess whether the data support the geometric-consistency argument over standard depth metrics.

Authors: The full manuscript already reports quantitative results, including standard depth metrics, error bars, dataset specifications, and ablation studies in the evaluation section and associated tables. To strengthen the link between these results and the geometric-consistency claims, we will expand the discussion of dynamic scenarios, add explicit references to the relevant tables in the abstract and introduction, and include additional breakdowns on dynamic subsets. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural design and evaluation remain independent of inputs

full rationale

The paper advances GemDepth via an explicit design insight that camera-motion awareness (via GEM pose prediction) is a prerequisite for 3D consistency, implemented in GEM and ASTT modules, then validated on external datasets. No equation, parameter fit, or self-citation reduces the claimed consistency to a quantity chosen on the same data or to a prior result by the same authors. The derivation chain from insight to modules to reported metrics is self-contained and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented physical entities are stated. The two new modules (GEM and ASTT) and the data-efficient training strategy are presented as engineering contributions rather than new theoretical entities.

pith-pipeline@v0.9.0 · 5776 in / 1191 out tokens · 78163 ms · 2026-05-20T22:56:01.236156+00:00 · methodology

Review history (4 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings... explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Birkl, R., Wofk, D., and M¨uller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation.arXiv preprint arXiv:2307.14460,

work page arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align you...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Romeo: Robust metric visual odometry

Cheng, J., Cai, Z., Zhang, Z., Yin, W., Muller, M., Paulitsch, M., and Yang, X. Romeo: Robust metric visual odometry. arXiv preprint arXiv:2412.11530, 2024a. Cheng, J., Xu, G., Guo, P., and Yang, X. Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation.International Journal of Computer Vision, 132(1):56–73, 2024b. ...

work page arXiv
[6]

Deep ordinal regression network for monocular depth estimation

Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011,

work page 2002
[7]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y ., Quan, L., and Shan, Y . Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 2005–2015,

work page 2005
[8]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., M¨uller, N., Sch¨onberger, J., Porzi, L., Zhang, Y ., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J. H., Chen, D. Y ., Li, Z., Shi, G., Feng, J., and Kang, B. Depth anything 3: Recov- ering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

Sobko, I., Riemenschneider, H., Gross, M., and Schroers, C. Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

work page arXiv
[12]

Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large naturalistic indoor robotics stereo dataset to train deep models fo...

work page arXiv 1912
[13]

Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp

Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp. arXiv–2501, 2025b. Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Zerostereo: Zero-shot stereo matching from single images. InPro...

work page arXiv
[14]

Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

Xu, G., Lin, H., Luo, H., Wang, X., Yao, J., Zhu, L., Pu, Y ., Chi, C., Sun, H., Wang, B., et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a. Xu, G., Liu, J., Wang, X., Cheng, J., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Banet: Bilateral aggregation network for mobile stereo matching. InPr...

work page arXiv
[15]

Table 8.Summary of datasets used for training

This approach enables the network to generalize across varying aspect ratios and resolutions. Table 8.Summary of datasets used for training. Dataset Indoor Outdoor # Images Pose-annotated datasets Virtual KITTI 2 (Cabon et al., 2020)✓40K Tartanair (Wang et al., 2020)✓ ✓300K PointOdyssey (Zheng et al., 2023)✓ ✓70K MVS-Synth (Huang et al., 2018)✓80K Dynamic...

work page 2020
[16]

dataset as a benchmark. Specifically, we conduct a comparative analysis between two versions of our GemDepth and the current state-of-the-art method, VideoDepthAnything (VDA) (Chen et al., 2025), under five different sequence lengths: 500, 400, 300, 200, and 100 frames. As demonstrated in Tab. 10, both versions of GemDepth consistently outperform VDA acro...

work page 2025

[1] [1]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Birkl, R., Wofk, D., and M¨uller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation.arXiv preprint arXiv:2307.14460,

work page arXiv

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align you...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[4] [4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Romeo: Robust metric visual odometry

Cheng, J., Cai, Z., Zhang, Z., Yin, W., Muller, M., Paulitsch, M., and Yang, X. Romeo: Robust metric visual odometry. arXiv preprint arXiv:2412.11530, 2024a. Cheng, J., Xu, G., Guo, P., and Yang, X. Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation.International Journal of Computer Vision, 132(1):56–73, 2024b. ...

work page arXiv

[6] [6]

Deep ordinal regression network for monocular depth estimation

Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011,

work page 2002

[7] [7]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y ., Quan, L., and Shan, Y . Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 2005–2015,

work page 2005

[8] [8]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., M¨uller, N., Sch¨onberger, J., Porzi, L., Zhang, Y ., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J. H., Chen, D. Y ., Li, Z., Shi, G., Feng, J., and Kang, B. Depth anything 3: Recov- ering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

Sobko, I., Riemenschneider, H., Gross, M., and Schroers, C. Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

work page arXiv

[12] [12]

Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large naturalistic indoor robotics stereo dataset to train deep models fo...

work page arXiv 1912

[13] [13]

Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp

Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp. arXiv–2501, 2025b. Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Zerostereo: Zero-shot stereo matching from single images. InPro...

work page arXiv

[14] [14]

Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

Xu, G., Lin, H., Luo, H., Wang, X., Yao, J., Zhu, L., Pu, Y ., Chi, C., Sun, H., Wang, B., et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a. Xu, G., Liu, J., Wang, X., Cheng, J., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Banet: Bilateral aggregation network for mobile stereo matching. InPr...

work page arXiv

[15] [15]

Table 8.Summary of datasets used for training

This approach enables the network to generalize across varying aspect ratios and resolutions. Table 8.Summary of datasets used for training. Dataset Indoor Outdoor # Images Pose-annotated datasets Virtual KITTI 2 (Cabon et al., 2020)✓40K Tartanair (Wang et al., 2020)✓ ✓300K PointOdyssey (Zheng et al., 2023)✓ ✓70K MVS-Synth (Huang et al., 2018)✓80K Dynamic...

work page 2020

[16] [16]

dataset as a benchmark. Specifically, we conduct a comparative analysis between two versions of our GemDepth and the current state-of-the-art method, VideoDepthAnything (VDA) (Chen et al., 2025), under five different sequence lengths: 500, 400, 300, 200, and 100 frames. As demonstrated in Tab. 10, both versions of GemDepth consistently outperform VDA acro...

work page 2025