arxiv: 2605.10525 · v3 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Yuecheng LiulJunda Cheng , Longliang Liu , Wenjing Liao , Hanrui Cheng , Yuzhou Wang , Xin Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords video depth estimation3D geometric consistencycamera pose predictionspatio-temporal transformerdynamic scenesmonocular depth

0 comments

The pith

GemDepth achieves 3D-consistent video depth by predicting camera poses to embed geometric structure into a transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that video depth estimation needs explicit awareness of camera motion and global 3D structure to maintain strict geometric consistency, especially under rotations and large view changes. Existing methods that rely mainly on temporal smoothing fail at this because they lack intrinsic 3D alignment. GemDepth addresses the gap with a Geometry-Embedding Module that predicts inter-frame poses and injects the resulting embeddings, which then guide an Alternating Spatio-Temporal Transformer to recover sharp point-level correspondences across space and time. The result is higher spatial detail and better consistency on dynamic scenes, achieved with relatively efficient training.

Core claim

GemDepth shows that an explicit Geometry-Embedding Module, which predicts inter-frame camera poses to produce implicit geometric embeddings, supplies the network with intrinsic 3D perception; these cues let the Alternating Spatio-Temporal Transformer capture latent point-level correspondences that simultaneously sharpen fine spatial details and enforce rigorous temporal consistency under camera motion.

What carries the argument

Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings injected into the network.

If this is right

Spatial details remain sharp in fine regions instead of blurring.
Temporal consistency holds rigorously even when the camera rotates or changes viewpoint drastically.
Performance reaches state-of-the-art levels on multiple datasets, especially complex dynamic scenes.
Training remains data-efficient while still delivering robust geometric consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pose-prediction-plus-embedding pattern could be tested on related tasks such as video object tracking or novel-view synthesis to check whether explicit motion geometry transfers.
If the geometric embeddings prove sufficient, future video depth systems might reduce reliance on heavy post-processing smoothing.
Evaluating the method on synthetic sequences with perfectly known ground-truth poses would isolate how much the embeddings contribute versus other network components.

Load-bearing premise

Accurate prediction of inter-frame camera poses is feasible and directly supplies the geometric embeddings required for strict 3D consistency.

What would settle it

Depth maps from the model that still exhibit clear temporal inconsistencies on sequences with known large rotations or viewpoint changes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10525 by Hanrui Cheng, Longliang Liu, Wenjing Liao, Xin Yang, Yuecheng LiulJunda Cheng, Yuzhou Wang.

**Figure 1.** Figure 1: Qualitative comparison on the KITTI (Geiger et al., 2013) dataset. Compared with the state-of-the-art VideoDepthAnything (VDA) (Chen et al., 2025), our GemDepth-VDA demonstrates superior capability in resolving intricate background details and preserving fine structures. Left: Full-resolution of the video depth sequences; Right: Zoomed-in views. Abstract Video depth estimation extends monocular prediction… view at source ↗

**Figure 2.** Figure 2: Leaderboard performance. The radar chart illustrates that GemDepth ranks first across all four benchmark video datasets, significantly advancing the state-of-the-art. The reported metric is the AbsRel error. 1 arXiv:2605.10525v1 [cs.CV] 11 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of zero-shot point clouds on KITTI. Accumulated from 10 consecutive frames using GT poses, GemDepth demonstrates superior 3D temporal consistency, maintaining structural integrity even under camera rotation and large ego-motion. 1. Introduction Monocular depth estimation is a cornerstone of computer vision, underpinning applications from autonomous driving to augmented reality (Holynski & Ko… view at source ↗

**Figure 4.** Figure 4: Overview of GemDepth-DAv2. Built upon the ViT-based encoder and DPT head of DepthAnythingV2 (Yang et al., 2024b), GemDepth-DAv2 incorporates two novel components: Geometry-Embedding Module (GEM) and Alternating Spatio-Temporal Transformer (ASTT). By synergistically aggregating 3D geometric constraints and multi-scale spatio-temporal interactions, GemDepthDAv2 effectively addresses the long-standing incons… view at source ↗

**Figure 5.** Figure 5: Qualitative results of temporal consistency on videos of varying lengths. We compared GemDepth-VDA with DepthAnythingV2 (Yang et al., 2024b) and VideoDepthAnything (Chen et al., 2025) using sequences of increasing lengths from Sintel (Butler et al., 2012), DAVIS (Perazzi et al., 2016), KITTI (Geiger et al., 2013), and in-the-wild datasets. The red boxes highlight the temporally inconsistent depth estimati… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of spatial accuracy on Sintel (Butler et al., 2012), Bonn (Palazzolo et al., 2019), and Scannet (Dai et al., 2017). As indicated by the white arrows, GemDepth-VDA outperforms existing approaches in recovering background depth and preserving fine structural details. metric integrity of dynamic subjects. To assess temporal stability, we visualize temporal profiles in [PITH_FULL_IMAGE… view at source ↗

**Figure 8.** Figure 8: Placement of ASTT across different feature processing stages. Effectiveness of the two-stage training strategy. To validate the efficacy of our proposed two-stage training strategy, we assess model performance upon the completion of each phase on four benchmarks. As detailed in Tab. 6, we observe a consistent upward trajectory for both GemDepth-DAv2 and GemDepth-VDA: performance monotonically improves as… view at source ↗

**Figure 9.** Figure 9: More qualitative comparison on the KITTI (Geiger et al., 2013) dataset. B.2. Quantitative Results on Videos of Varying Lengths We evaluate the robustness of our model across varying video lengths using the Bonn (Palazzolo et al., 2019) dataset as a benchmark. Specifically, we conduct a comparative analysis between two versions of our GemDepth and the current state-of-the-art method, VideoDepthAnything (VDA… view at source ↗

read the original abstract

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GemDepth adds a pose-prediction module to create geometric embeddings for video depth consistency, but the abstract gives no numbers so the actual gains are hard to judge.

read the letter

GemDepth's core idea is to predict inter-frame camera poses, turn those into geometric embeddings via a new Geometry-Embedding Module, and then feed the embeddings into an Alternating Spatio-Temporal Transformer that tries to keep point-level correspondences sharp across frames. The authors claim this gives the network built-in 3D awareness that plain temporal-smoothing transformers lack, especially during rotations or large view changes. They also release the code, which is useful for anyone who wants to inspect or extend the implementation.

Referee Report

2 major / 2 minor

Summary. The paper proposes GemDepth for video depth estimation, introducing a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings, which guide an Alternating Spatio-Temporal Transformer (ASTT) to capture point-level correspondences for enhanced spatial detail and strict 3D temporal consistency. It claims SOTA performance across datasets, especially in complex dynamic scenes, via a data-efficient training strategy.

Significance. If the central claims hold, the explicit injection of camera-motion priors via GEM could meaningfully advance 3D-consistent video depth beyond pure temporal-smoothing Transformers, with potential benefits for dynamic-scene applications; the public code release supports reproducibility.

major comments (2)

[Abstract / §3] Abstract and §3 (GEM description): the claim that predicted inter-frame poses yield reliable geometric embeddings for strict 3D consistency under rotations or drastic view changes rests on an untested rigid-scene assumption; dynamic scenes with independent object motion can cause standard pose estimators to fail, producing misaligned embeddings that undermine the ASTT correspondences. No ablation isolating GEM, no pose-supervision details, and no dynamic-object masking strategy are described.
[Abstract / Evaluation] Abstract and evaluation section: SOTA is asserted via Fig. 2 with no quantitative metrics, error bars, per-dataset tables, or ablation studies provided in the text, so the strength of the performance claim (especially the dynamic-scenario advantage) cannot be verified.

minor comments (2)

[§3] Notation for GEM and ASTT is introduced without a clear diagram or pseudocode; adding one would improve readability.
[§4] The data-efficient training strategy is mentioned but not detailed with respect to loss terms or dataset splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that several clarifications and additions are needed to strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (GEM description): the claim that predicted inter-frame poses yield reliable geometric embeddings for strict 3D consistency under rotations or drastic view changes rests on an untested rigid-scene assumption; dynamic scenes with independent object motion can cause standard pose estimators to fail, producing misaligned embeddings that undermine the ASTT correspondences. No ablation isolating GEM, no pose-supervision details, and no dynamic-object masking strategy are described.

Authors: We acknowledge the validity of this concern. The current description in §3 does not explicitly address how GEM behaves under significant independent object motion, nor does it include an ablation isolating GEM or details on pose supervision. In the revised manuscript we will: (1) add an ablation study that isolates the GEM module and quantifies its contribution to 3D consistency; (2) provide the exact pose-supervision losses (photometric consistency and smoothness terms) and training protocol; and (3) include a brief discussion of robustness to dynamic objects, noting that the alternating spatio-temporal transformer can partially compensate for misaligned embeddings. We will also qualify the claim regarding “strict 3D consistency” to reflect the practical limitations of pose estimation in highly dynamic scenes. revision: yes
Referee: [Abstract / Evaluation] Abstract and evaluation section: SOTA is asserted via Fig. 2 with no quantitative metrics, error bars, per-dataset tables, or ablation studies provided in the text, so the strength of the performance claim (especially the dynamic-scenario advantage) cannot be verified.

Authors: We apologize for the insufficient presentation of quantitative results. The full evaluation section contains per-dataset tables reporting AbsRel, SqRel, RMSE, and δ1 metrics, together with ablation studies; however, these were not sufficiently highlighted in the text or abstract. In the revision we will: (1) explicitly quote the key numerical results and per-dataset breakdowns in the main text; (2) add error bars to the reported metrics where multiple runs were performed; and (3) expand the ablation subsection to directly support the dynamic-scene advantage claim. These changes will make the performance assertions verifiable without relying solely on Fig. 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework presented as independent architectural addition

full rationale

The paper describes GemDepth as introducing a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate embeddings, followed by an Alternating Spatio-Temporal Transformer (ASTT) for point-level correspondences. No equations, derivations, or self-citations are exhibited that reduce the claimed 3D consistency or SOTA performance to fitted parameters or prior results by construction. The approach is framed as an explicit architectural insight independent of the target outputs, with evaluations on external datasets. This is the most common honest finding for papers whose central contribution is a new module rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract supplies no explicit free parameters, axioms, or invented physical entities; the new modules (GEM, ASTT) are architectural inventions whose independent evidence is the claimed empirical improvement.

axioms (1)

domain assumption Explicit camera-motion awareness is a prerequisite for strict 3D geometric consistency under rotation or view change.
Stated directly in the second paragraph of the abstract as the core insight motivating the GEM module.

invented entities (2)

Geometry-Embedding Module (GEM) no independent evidence
purpose: Predict inter-frame camera poses to generate implicit geometric embeddings that equip the network with intrinsic 3D perception.
New module introduced in the abstract; no external falsifiable handle supplied beyond the overall depth accuracy claim.
Alternating Spatio-Temporal Transformer (ASTT) no independent evidence
purpose: Capture latent point-level correspondences to enhance spatial precision and enforce temporal consistency.
New transformer variant introduced in the abstract; evidence is the claimed SOTA performance.

pith-pipeline@v0.9.0 · 5545 in / 1268 out tokens · 34083 ms · 2026-05-14T21:24:42.864887+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings... Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no mention of recognition cost, golden-ratio identities, 8-tick clocks or dimension forcing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

[1]

1–a model zoo for robust monocular relative depth estimation

Birkl, R., Wofk, D., and M¨uller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation.arXiv preprint arXiv:2307.14460,

work page arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align you...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2001.10773,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512,

work page internal anchor Pith review arXiv
[5]

Romeo: Robust metric visual odometry

Cheng, J., Cai, Z., Zhang, Z., Yin, W., Muller, M., Paulitsch, M., and Yang, X. Romeo: Robust metric visual odometry. arXiv preprint arXiv:2412.11530, 2024a. Cheng, J., Xu, G., Guo, P., and Yang, X. Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation.International Journal of Computer Vision, 132(1):56–73, 2024b. ...

work page arXiv
[6]

Deep ordinal regression network for monocular depth estimation

Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. Deep ordinal regression network for monocular depth estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011,

work page 2002
[7]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y ., Quan, L., and Shan, Y . Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 2005–2015,

work page 2005
[8]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., M¨uller, N., Sch¨onberger, J., Porzi, L., Zhang, Y ., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J. H., Chen, D. Y ., Li, Z., Shi, G., Feng, J., and Kang, B. Depth anything 3: Recov- ering the visual space from any views.arXiv preprint arXiv:2511.10647,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El- Nouby, A., et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

Sobko, I., Riemenschneider, H., Gross, M., and Schroers, C. Stabledpt: Temporal stable monocular video depth estimation.arXiv preprint arXiv:2601.02793,

work page arXiv
[12]

Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., and Chu, X. Irs: A large naturalistic indoor robotics stereo dataset to train deep models fo...

work page arXiv 1912
[13]

Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp

Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp. arXiv–2501, 2025b. Wang, X., Yang, H., Xu, G., Cheng, J., Lin, M., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Zerostereo: Zero-shot stereo matching from single images. InPro...

work page arXiv
[14]

Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

Xu, G., Lin, H., Luo, H., Wang, X., Yao, J., Zhu, L., Pu, Y ., Chi, C., Sun, H., Wang, B., et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a. Xu, G., Liu, J., Wang, X., Cheng, J., Deng, Y ., Zang, J., Chen, Y ., and Yang, X. Banet: Bilateral aggregation network for mobile stereo matching. InPr...

work page arXiv
[15]

Table 8.Summary of datasets used for training

This approach enables the network to generalize across varying aspect ratios and resolutions. Table 8.Summary of datasets used for training. Dataset Indoor Outdoor # Images Pose-annotated datasets Virtual KITTI 2 (Cabon et al., 2020)✓40K Tartanair (Wang et al., 2020)✓ ✓300K PointOdyssey (Zheng et al., 2023)✓ ✓70K MVS-Synth (Huang et al., 2018)✓80K Dynamic...

work page 2020
[16]

dataset as a benchmark. Specifically, we conduct a comparative analysis between two versions of our GemDepth and the current state-of-the-art method, VideoDepthAnything (VDA) (Chen et al., 2025), under five different sequence lengths: 500, 400, 300, 200, and 100 frames. As demonstrated in Tab. 10, both versions of GemDepth consistently outperform VDA acro...

work page 2025