arxiv: 2603.24577 · v2 · submitted 2026-03-25 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan , Yi Xie , Arnis Lektauers , Bo Liu , Jerzy Rozenblit

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords depth estimation3D reconstructionsurgical roboticsgraph attention networkdeformable tissueszero-shot generalizationDeGATEndoVGGT

0 comments

The pith

Dynamic feature-space graphs in DeGAT recover consistent depth maps for occluded surgical tissues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EndoVGGT to estimate depth for 3D reconstruction of deformable soft tissues during surgery. Fixed-topology methods break down on low-texture surfaces, specular highlights, and instrument occlusions, so the authors replace static spatial neighborhoods with a Deformation-aware Graph Attention module. DeGAT builds graphs dynamically in feature space to link distant but coherent tissue regions. This propagates structural cues across gaps and enforces global consistency for non-rigid motion. Tests on SCARED report large gains in PSNR and SSIM plus zero-shot transfer to EndoNeRF, showing the graphs learn domain-independent geometric rules.

Core claim

EndoVGGT equips a geometry-centric framework with a Deformation-aware Graph Attention (DeGAT) module that dynamically constructs feature-space semantic graphs. These graphs capture long-range correlations among coherent tissue regions, enabling robust propagation of structural cues across occlusions, enforcing global consistency, and improving non-rigid deformation recovery.

What carries the argument

The Deformation-aware Graph Attention (DeGAT) module, which dynamically constructs feature-space semantic graphs to capture long-range correlations among tissue regions instead of using static spatial neighborhoods.

If this is right

Higher PSNR and SSIM scores on the SCARED benchmark for depth estimation.
Improved recovery of non-rigid deformations in soft tissue surfaces.
Strong zero-shot generalization to unseen surgical domains such as EndoNeRF.
Enforced global consistency in reconstructed 3D models despite instrument occlusions and specular highlights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic graph mechanism could be tested on other deformable-object reconstruction tasks such as endoscopic inspection of industrial pipes.
Replacing fixed neighborhoods with learned feature graphs may reduce the volume of labeled surgical data needed for training future models.
Real-time implementation on robotic platforms would need to measure whether the graph construction adds latency that affects closed-loop control.

Load-bearing premise

Dynamically constructing feature-space semantic graphs will reliably capture long-range tissue correlations and enforce global consistency without introducing new artifacts or requiring dataset-specific tuning.

What would settle it

A new surgical video dataset with heavy occlusions and low texture where EndoVGGT shows no PSNR or SSIM gain and produces visible depth artifacts compared with the prior state-of-the-art baseline.

Figures

Figures reproduced from arXiv: 2603.24577 by Arnis Lektauers, Bo Liu, Falong Fan, Jerzy Rozenblit, Yi Xie.

**Figure 1.** Figure 1: Visualization of DeGAT neighbor aggregation. (a–b) Visualization of neighborhood construction and feature responses in the proposed DeGAT module. ⋆ indicates the centroid and ◦ indicates its neighbors. The highlighted ⋆ aggregates informative context even across instrument boundaries, enabling robust feature refinement. (c–d) Depth estimation comparison without (c) and with (d) DeGAT. Incorporating DeGAT y… view at source ↗

**Figure 2.** Figure 2: Overview of the EndoVGGT framework. The proposed DeGAT module enhances the features extracted from DINOv2 [16], and camera tokens interact via both global and within-frame attention mechanisms. The depth maps are predicted using a DPT head [19], and camera poses are predicted by an MLP to reconstruct the input scene, and are constrained by a composite loss introduced in Sec. 3.2. Problem setting and geomet… view at source ↗

**Figure 3.** Figure 3: Experiment results on EndoNeRF and SCARED dataset. “Average” denotes the mean performance across all evaluated subsets [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of DeGAT at different levels. The red boxes highlight complex instrument-tissue boundaries. Feature-level DeGAT (d) preserves sharper continuity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation Study on the number of neighbors [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

read the original abstract

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EndoVGGT introduces a DeGAT module for dynamic feature-space graphs in surgical depth estimation, but the large reported gains rest on thin experimental detail.

read the letter

EndoVGGT's main addition is the DeGAT module, which replaces fixed spatial neighborhoods with attention-driven graphs built in feature space. The goal is to link coherent tissue regions across longer distances so that depth cues can propagate through occlusions and non-rigid motion in endoscopic scenes. That choice directly addresses the low-texture, specular, and instrument-blocked conditions that break standard reconstruction pipelines. The zero-shot transfer claim to EndoNeRF is the part that stands out most; if the numbers hold, it suggests the learned priors are less domain-specific than typical fine-tuned models. The paper therefore does a clean job of framing a concrete surgical problem and offering a targeted GNN-style fix. The soft spots are mostly around verification. The abstract states 24.6 % PSNR and 9.1 % SSIM lifts over prior work, yet supplies no baseline list, no run count, no error bars, and no ablation that isolates the dynamic graph step from the rest of the network. Without those, it is difficult to judge whether the gains come from the new module or from other unmentioned changes. The concern that attention-based edges could create spurious connections under highlights or fast deformation is reasonable and also unaddressed in the given text; graph statistics or failure-case analysis would have helped. This paper is aimed at the narrow but active group working on robotic surgery perception and intraoperative 3D reconstruction. Readers already following GNN applications in medical imaging would find the module description worth reading if the full methods and results sections contain the missing controls. I would send it to peer review so referees can check the implementation details and run the ablations that are needed to make the claims solid.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes EndoVGGT, a geometry-centric framework for depth estimation and 3D reconstruction of deformable soft tissues in surgery. It introduces the Deformation-aware Graph Attention (DeGAT) module, which dynamically constructs feature-space semantic graphs (instead of static spatial neighborhoods) to capture long-range correlations among coherent tissue regions, aiming to handle low-texture surfaces, specular highlights, and instrument occlusions for improved global consistency and non-rigid deformation recovery. Experiments are claimed to show 24.6% PSNR and 9.1% SSIM gains over prior SOTA on SCARED, plus strong zero-shot cross-dataset generalization to EndoNeRF confirming domain-agnostic priors.

Significance. If the performance gains and generalization hold under rigorous validation, the work could meaningfully advance surgical robotic perception by providing more robust 3D reconstructions in challenging real-world conditions. The shift to dynamic feature-space graphs addresses a recognized limitation of fixed-topology methods and the reported cross-domain results suggest practical utility. The paper does not ship machine-checked proofs or parameter-free derivations, but the empirical focus on generalization is a strength worth verifying.

major comments (2)

[Abstract] Abstract: The central claims of 24.6% PSNR and 9.1% SSIM improvement on SCARED, plus zero-shot generalization, are presented without any experimental protocol, baseline details, error bars, number of runs, or ablation results, leaving the performance claims unverifiable and load-bearing for the paper's contribution.
[Method] Method (DeGAT description): No ablation isolates the dynamic graph construction from the rest of the pipeline, nor are graph statistics (e.g., average degree, attention entropy) or stability analysis under specular highlights/occlusions provided; this directly bears on whether DeGAT reliably enforces global consistency without artifacts or dataset-specific tuning.

minor comments (1)

[Abstract] Abstract: The phrasing 'zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains' appears inconsistent with the statement that experiments are on SCARED; clarify the train/test split and which dataset is truly unseen.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with point-by-point responses and have revised the manuscript where appropriate to strengthen the presentation of results and ablations.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 24.6% PSNR and 9.1% SSIM improvement on SCARED, plus zero-shot generalization, are presented without any experimental protocol, baseline details, error bars, number of runs, or ablation results, leaving the performance claims unverifiable and load-bearing for the paper's contribution.

Authors: The experimental protocol, baselines, error bars (computed over 5 runs), number of runs, and ablation results are fully detailed in Sections 4.1–4.3 and the supplementary material. The abstract serves as a concise summary of the key outcomes. To improve verifiability at the abstract level, we have revised it to include a brief reference to the SCARED evaluation protocol and cross-dataset zero-shot testing on EndoNeRF. We believe this balances brevity with transparency without expanding the abstract beyond standard length limits. revision: partial
Referee: [Method] Method (DeGAT description): No ablation isolates the dynamic graph construction from the rest of the pipeline, nor are graph statistics (e.g., average degree, attention entropy) or stability analysis under specular highlights/occlusions provided; this directly bears on whether DeGAT reliably enforces global consistency without artifacts or dataset-specific tuning.

Authors: We agree that isolating the dynamic graph construction is essential to substantiate its contribution. In the revised manuscript, we have added a dedicated ablation study in Section 4.2 that directly compares the full DeGAT module against a static spatial neighborhood variant, confirming the benefit of feature-space dynamic graphs. We now report graph statistics (average degree and attention entropy) in Section 3.2 and include a stability analysis with both quantitative metrics and qualitative examples under specular highlights and occlusions in the supplementary material, showing consistent global coherence without introduced artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on external datasets are independent of any derivation

full rationale

The paper introduces EndoVGGT with a DeGAT module that dynamically builds feature-space graphs for depth estimation. All central claims (PSNR/SSIM gains on SCARED, zero-shot generalization to EndoNeRF) are presented as outcomes of experimental evaluation on held-out datasets rather than quantities derived from equations or parameters fitted to the target metrics. No mathematical derivations, self-definitional relations, or fitted-input predictions appear in the provided text. Any self-citations are incidental and not load-bearing for the reported numbers, which remain externally falsifiable by re-running the model on the same public benchmarks. The chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven efficacy of the newly introduced DeGAT module for capturing long-range correlations; no free parameters or external axioms are explicitly listed in the abstract.

axioms (1)

domain assumption Dynamic feature-space graphs can propagate structural cues across occlusions in surgical scenes.
Invoked to justify the DeGAT design and its claimed robustness.

invented entities (1)

Deformation-aware Graph Attention (DeGAT) module no independent evidence
purpose: Dynamically construct feature-space semantic graphs to capture long-range tissue correlations.
New component introduced by the paper to address limitations of fixed-topology methods.

pith-pipeline@v0.9.0 · 5501 in / 1313 out tokens · 32566 ms · 2026-05-14T23:59:05.529063+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeGAT dynamically constructs feature-space semantic graphs... attention logit ℓij = a⊤LeakyReLU(Wproj[xt,i∥xt,j])
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1: Stability... ∥xout t,i∥2 ≤ ∥xt,i∥2 + maxj∈N(i)∥Wval xt,j∥2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 9 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2101.01133 (2021)

Allan, M., Mcleod, J., Wang, C., Rosenthal, J.C., Hu, Z., Gard, N., Eisert, P., Fu, K.X., Zeffiro, T., Xia, W., et al.: Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133 (2021)

work page arXiv 2021
[2]

JAMA Otolaryngology–Head & Neck Surgery150(4), 318–326 (2024)

Bartholomew, R.A., Zhou, H., Boreel, M., Suresh, K., Gupta, S., Mitchell, M.B., Hong, C., Lee, S.E., Smith, T.R., Guenette, J.P., et al.: Surgical navigation in the anterior skull base using 3-dimensional endoscopy and surface reconstruction. JAMA Otolaryngology–Head & Neck Surgery150(4), 318–326 (2024)

2024
[3]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.072894(5), 11 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

International Journal of Computer Assisted Radiology and Surgery19(6), 1013–1020 (2024)

Cui, B., Islam, M., Bai, L., Ren, H.: Surgical-dino: adapter learning of founda- tion models for depth estimation in endoscopic surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1013–1020 (2024)

2024
[6]

International journal of computer assisted radiology and surgery14(7), 1217–1225 (2019)

Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3d convolutional neural networks. International journal of computer assisted radiology and surgery14(7), 1217–1225 (2019)

2019
[7]

Journal of imaging11(2), 44 (2025)

Göbel, B., Huurdeman, J., Reiterer, A., Möller, K.: Robot-based procedure for 3d reconstruction of abdominal organs using the iterative closest point and pose graph algorithms. Journal of imaging11(2), 44 (2025)

2025
[8]

Advances in neural information processing systems35, 8291–8303 (2022)

Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision gnn: An image is worth graph of nodes. Advances in neural information processing systems35, 8291–8303 (2022)

2022
[9]

He, Z., Wang, T.: Openlrm: Open-source large reconstruction models (2023)

2023
[10]

Scientific Reports13(1), 15380 (2023)

Hirohata, Y., Sogabe, M., Miyazaki, T., Kawase, T., Kawashima, K.: Confidence- aware self-supervised learning for dense monocular depth estimation in dynamic laparoscopic scene. Scientific Reports13(1), 15380 (2023)

2023
[11]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023
[12]

arXiv preprint arXiv:2401.12561 (2024)

Liu, Y., Li, C., Yang, C., Yuan, Y.: Endogaussian: Real-time gaussian splatting for dynamic endoscopic scene reconstruction. arXiv preprint arXiv:2401.12561 (2024)

work page arXiv 2024
[13]

arXiv preprint arXiv:2302.13219 (2023)

Lu, Y., Wei, R., Li, B., Chen, W., Zhou, J., Dou, Q., Sun, D., Liu, Y.h.: Autonomous intelligent navigation for flexible endoscopy using monocular depth guidance and 3-d shape planning. arXiv preprint arXiv:2302.13219 (2023)

work page arXiv 2023
[14]

Commu- nications of the ACM65(1), 99–106 (2021)

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

2021
[15]

In: 2009 Annual International Con- ference of the IEEE Engineering in Medicine and Biology Society

Mountney, P., Yang, G.Z.: Dynamic view expansion for minimally invasive surgery using simultaneous localization and mapping. In: 2009 Annual International Con- ference of the IEEE Engineering in Medicine and Biology Society. pp. 1184–1187. IEEE (2009)

2009
[16]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Medical image analysis71, 102058 (2021) 10 F

Ozyoruk, K.B., Gokceler, G.I., Bobrow, T.L., Coskun, G., Incetan, K., Almalioglu, Y., Mahmood, F., Curto, E., Perdigoto, L., Oliveira, M., et al.: Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Medical image analysis71, 102058 (2021) 10 F. Fan et al

2021
[18]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018
[19]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)

2021
[20]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

2016
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5459–5469 (2022)

2022
[22]

Graph Attention Networks

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

In: Proceedings of the Symposium on Modeling and Simulation in Medicine

Wagner, A., Rozenblit, J.W.: Augmented reality visual guidance for spatial percep- tion in the computer assisted surgical trainer. In: Proceedings of the Symposium on Modeling and Simulation in Medicine. pp. 1–12 (2017)

2017
[24]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[25]

arXiv preprint arXiv:2503.22437 (2025)

Wang, X., Zhang, S., Huang, B., Stoyanov, D., Mazomenos, E.B.: Endolrmgs: Complete endoscopic scene reconstruction combining large reconstruction modelling and gaussian splatting. arXiv preprint arXiv:2503.22437 (2025)

work page arXiv 2025
[26]

In: International conference on medical image computing and computer-assisted intervention

Wang,Y.,Long,Y.,Fan,S.H.,Dou,Q.:Neuralrenderingforstereo3dreconstruction of deformable tissues in robotic surgery. In: International conference on medical image computing and computer-assisted intervention. pp. 431–441. Springer (2022)

2022
[27]

Artificial Intelligence Surgery4(3), 187–198 (2024)

Wei, R., Guo, J., Lu, Y., Zhong, F., Liu, Y., Sun, D., Dou, Q.: Scale-aware monocular reconstruction via robot kinematics and visual data in neural radiance fields. Artificial Intelligence Surgery4(3), 187–198 (2024)

2024
[28]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Xu, M., Guo, Z., Wang, A., Bai, L., Ren, H.: A review of 3d reconstruction techniques for deformable tissues in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 157–167. Springer (2024)

2024
[29]

In: Proceedings of the European conference on computer vision (ECCV)

Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV). pp. 767–783 (2018)

2018
[30]

In: International conference on medical image computing and computer-assisted intervention

Zha, R., Cheng, X., Li, H., Harandi, M., Ge, Z.: Endosurf: Neural surface recon- struction of deformable tissues with stereo endoscope videos. In: International conference on medical image computing and computer-assisted intervention. pp. 13–23. Springer (2023)

2023
[31]

APPENDIX 11 6 Appendix 7 Theoretical Properties of DeGAT This appendix formalizes several properties of DeGAT that support its use as a geometry-consistent feature refinement operator under deformation and occlusion. 7.1 Row-stochastic aggregation and convexity Lemma 1(Row-stochasticity of DeGAT attention).For each nodei, the attention coefficients {αij}j...
[32]

IMPLEMENTATION DETAILS OF DEGAT 13 7.3 Permutation equivariance (token indexing should not matter) Proposition 2(Permutation equivariance of one-hop DeGAT).Letπ be a permutation of token indices andPthe corresponding permutation matrix. If token features and coordinates are permuted consistently,X′ =PXandp ′ π(i) =p i, then the DeGAT output permutes in th...
[33]

Forr 2 >0, substituting ˆC ⋆ =α/(γr 2)intoJyields min ˆC>0 J( ˆC) =α−αlog α γr 2 =αlog(γr 2) +const, where “const” is independent of the model outputs

EXPERIMENT DETAILS 15 Corollary 3(Equivalent marginal penalty after eliminating confidence). Forr 2 >0, substituting ˆC ⋆ =α/(γr 2)intoJyields min ˆC>0 J( ˆC) =α−αlog α γr 2 =αlog(γr 2) +const, where “const” is independent of the model outputs. Proof. Direct substitution givesJ ( ˆC ⋆) = γr 2 ·α/ (γr 2) −αlog (α/(γr 2)). Rear- ranging yields the stated fo...
[34]

–Update Attention Parameters{a, Wproj}:Let δij = ∇⊤ xagg,i vj be the gradient w.r.t

Backward Propagation.Given the gradient from the task lossL, denoted as ∇xagg,i = ∂L ∂xagg,i , the gradients for the learnable parameters are computed via the chain rule: –Update Value MatrixWval:The gradient flows through the weighted sum and the linear projection: ∇Wval ← X b,i X j∈N(i) ∇xagg,i ·α ij ⊗x j. –Update Attention Parameters{a, Wproj}:Let δij ...
[35]

EXPERIMENT DETAILS 17 11.2 Attention-Level DeGAT Implementation Details: MLP Bias Reference.Raffel et al.,Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). Input. –Semantic features:F ∈R B×N×C, patch tokens extracted from a frozen DINOv2 backbone. –Attention components:QueryQand KeyK ∈R B×H×N×d head from the VGGT fram...
[36]

The pairwise Euclidean distance matrix D∈R B×N×N is computed as di,j =∥f i −f j∥2 = vuut CX c=1 (fi,c −f j,c)2

Semantic Distance Calculation.We construct a fully connected semantic graph in which each node corresponds to a patch token, and the edge weights represent semantic dissimilarity. The pairwise Euclidean distance matrix D∈R B×N×N is computed as di,j =∥f i −f j∥2 = vuut CX c=1 (fi,c −f j,c)2. Smaller values ofdi,j indicate higher semantic similarity, e.g., ...
[37]

Logarithmic Transformation.To reduce the influence of outliers while increasing sensitivity to highly similar patches, a logarithmic transformation is applied: ˜di,j = log (di,j + 1). 3.Linear Mapping and Quantization.The transformed distances are nor- malized relative to the maximum semantic distance within the current view to ensure scale invariance: ra...
[38]

This mechanism allows the model to learn distinct attention bonuses or penalties for different levels of semantic similarity

Bias Lookup.For each attention headh, a learnable scalar bias is retrieved from the embedding table: b(h) i,j =b[Idx i,j]h . This mechanism allows the model to learn distinct attention bonuses or penalties for different levels of semantic similarity
[39]

Injection into Attention Mechanism.The semantic bias is added directly to the attention logits in the VGGT frame attention blocks: Attention(h) i,j =Softmax qi ·k ⊤ j√dhead +b (h) i,j !
[40]

The gradient of the loss with respect to the bias term is denoted asδ(h) i,j = ∂L ∂b(h) i,j

Backward Propagation.Let L be the total loss. The gradient of the loss with respect to the bias term is denoted asδ(h) i,j = ∂L ∂b(h) i,j . Since the bias is added directly to the logits, this gradient is derived from the standard Softmax backward pass. The learnable embedding tablebis updated by aggregating gradients from all patch pairs within the same ...
[41]

, B}, the continuous semantic bias is computed as follows

EXPERIMENT DETAILS 19 Procedure.Given a batch indexb∈ { 1, . . . , B}, the continuous semantic bias is computed as follows
[42]

The pairwise Euclidean distance matrix D∈R B×N×N is computed between all patch tokens: di,j =∥f i −f j∥2

Semantic Euclidean Distance.We abandon spatial coordinates in favor of feature-space representations. The pairwise Euclidean distance matrix D∈R B×N×N is computed between all patch tokens: di,j =∥f i −f j∥2 . This distance measures the raw semantic discrepancy between two patch tokens
[43]

First, a logarithmic transformation is applied: ˆdi,j = log(d i,j + 1)

Log-Space Normalization.To handle the heavy-tailed distribution of feature distances while preserving high resolution for semantically similar tokens, the distances are transformed and normalized in log space. First, a logarithmic transformation is applied: ˆdi,j = log(d i,j + 1). The transformed distances are then normalized using the maximum distance wi...
[44]

Here,W 1 ∈R M×1,W 2 ∈R H×M, andb 1,b 2 are learnable bias terms

Continuous Bias Generation.Instead of a discrete embedding lookup, a lightweight MLPΨ is employed to map the continuous distance coordinate to a head-specific attention bias: Bi,j =Ψ(x i,j) =W 2 ReLU(W1xi,j +b 1) +b 2. Here,W 1 ∈R M×1,W 2 ∈R H×M, andb 1,b 2 are learnable bias terms. The MLP enables approximation of arbitrary continuous functions, thereby ...
[45]

Injection into Attention Mechanism.The generated continuous bias is added directly to the attention logits of the self-attention operation: Attention(h) i,j =Softmax qi ·k ⊤ j√dhead +B (h) i,j !
[46]

Letδ(h) i,j = ∂L ∂B(h) i,j be the gradient of the loss with respect to the generated bias

Backward Propagation.Unlike quantization-based methods, the MLP projection enables end-to-end gradient flow. Letδ(h) i,j = ∂L ∂B(h) i,j be the gradient of the loss with respect to the generated bias. 20 F. Fan et al. –Update MLP ParametersΘΨ:Standard backpropagation is applied to update the MLP weights. Letzi,j = ReLU(W1xi,j +b 1)be the hidden activation....
[47]

cutting”, which depicts tissue excision with topological changes, and “pulling

EXPERIMENT DETAILS 21 11.5 Implementation Details of FiLM-based Camera Token Modulation We adopt a Feature-wise Linear Modulation (FiLM) mechanism [18] to condition the camera token on image content. This design allows the camera token to adapt dynamically to the input frames while preserving training stability, as indicated by thecls+ FiLM modulation in ...
[48]

This subsection provides definitions of the evaluation metrics; readers familiar with them may skip it

EXPERIMENT DETAILS 23 11.8 Formulas for PSNR, SSIM, and LPIPS To quantitatively evaluate the reconstruction quality, we employ three standard metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). This subsection provides definitions of the evaluation metrics; readers ...
[49]

For PSNR and SSIM, the higher↑ the better

ADDITIONAL EXPERIMENTS 25 Table 3:Experiment results on EndoNeRF and SCARED dataset using PSNR, SSIM, and LPIPS metrics. For PSNR and SSIM, the higher↑ the better. For LPIPS, the lower ↓ the better. Thebestresults are highlighted in green, and thesecond bestresults are underlined. Dataset Method PSNR ↑SSIM↑LPIPS↓ EndoNeRF- pulling VGGT 23.349 0.659 0.396 ...

work page arXiv