Recognition: 2 theorem links
· Lean TheoremEndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3
The pith
Dynamic feature-space graphs in DeGAT recover consistent depth maps for occluded surgical tissues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EndoVGGT equips a geometry-centric framework with a Deformation-aware Graph Attention (DeGAT) module that dynamically constructs feature-space semantic graphs. These graphs capture long-range correlations among coherent tissue regions, enabling robust propagation of structural cues across occlusions, enforcing global consistency, and improving non-rigid deformation recovery.
What carries the argument
The Deformation-aware Graph Attention (DeGAT) module, which dynamically constructs feature-space semantic graphs to capture long-range correlations among tissue regions instead of using static spatial neighborhoods.
If this is right
- Higher PSNR and SSIM scores on the SCARED benchmark for depth estimation.
- Improved recovery of non-rigid deformations in soft tissue surfaces.
- Strong zero-shot generalization to unseen surgical domains such as EndoNeRF.
- Enforced global consistency in reconstructed 3D models despite instrument occlusions and specular highlights.
Where Pith is reading between the lines
- The same dynamic graph mechanism could be tested on other deformable-object reconstruction tasks such as endoscopic inspection of industrial pipes.
- Replacing fixed neighborhoods with learned feature graphs may reduce the volume of labeled surgical data needed for training future models.
- Real-time implementation on robotic platforms would need to measure whether the graph construction adds latency that affects closed-loop control.
Load-bearing premise
Dynamically constructing feature-space semantic graphs will reliably capture long-range tissue correlations and enforce global consistency without introducing new artifacts or requiring dataset-specific tuning.
What would settle it
A new surgical video dataset with heavy occlusions and low texture where EndoVGGT shows no PSNR or SSIM gain and produces visible depth artifacts compared with the prior state-of-the-art baseline.
Figures
read the original abstract
Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes EndoVGGT, a geometry-centric framework for depth estimation and 3D reconstruction of deformable soft tissues in surgery. It introduces the Deformation-aware Graph Attention (DeGAT) module, which dynamically constructs feature-space semantic graphs (instead of static spatial neighborhoods) to capture long-range correlations among coherent tissue regions, aiming to handle low-texture surfaces, specular highlights, and instrument occlusions for improved global consistency and non-rigid deformation recovery. Experiments are claimed to show 24.6% PSNR and 9.1% SSIM gains over prior SOTA on SCARED, plus strong zero-shot cross-dataset generalization to EndoNeRF confirming domain-agnostic priors.
Significance. If the performance gains and generalization hold under rigorous validation, the work could meaningfully advance surgical robotic perception by providing more robust 3D reconstructions in challenging real-world conditions. The shift to dynamic feature-space graphs addresses a recognized limitation of fixed-topology methods and the reported cross-domain results suggest practical utility. The paper does not ship machine-checked proofs or parameter-free derivations, but the empirical focus on generalization is a strength worth verifying.
major comments (2)
- [Abstract] Abstract: The central claims of 24.6% PSNR and 9.1% SSIM improvement on SCARED, plus zero-shot generalization, are presented without any experimental protocol, baseline details, error bars, number of runs, or ablation results, leaving the performance claims unverifiable and load-bearing for the paper's contribution.
- [Method] Method (DeGAT description): No ablation isolates the dynamic graph construction from the rest of the pipeline, nor are graph statistics (e.g., average degree, attention entropy) or stability analysis under specular highlights/occlusions provided; this directly bears on whether DeGAT reliably enforces global consistency without artifacts or dataset-specific tuning.
minor comments (1)
- [Abstract] Abstract: The phrasing 'zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains' appears inconsistent with the statement that experiments are on SCARED; clarify the train/test split and which dataset is truly unseen.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with point-by-point responses and have revised the manuscript where appropriate to strengthen the presentation of results and ablations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 24.6% PSNR and 9.1% SSIM improvement on SCARED, plus zero-shot generalization, are presented without any experimental protocol, baseline details, error bars, number of runs, or ablation results, leaving the performance claims unverifiable and load-bearing for the paper's contribution.
Authors: The experimental protocol, baselines, error bars (computed over 5 runs), number of runs, and ablation results are fully detailed in Sections 4.1–4.3 and the supplementary material. The abstract serves as a concise summary of the key outcomes. To improve verifiability at the abstract level, we have revised it to include a brief reference to the SCARED evaluation protocol and cross-dataset zero-shot testing on EndoNeRF. We believe this balances brevity with transparency without expanding the abstract beyond standard length limits. revision: partial
-
Referee: [Method] Method (DeGAT description): No ablation isolates the dynamic graph construction from the rest of the pipeline, nor are graph statistics (e.g., average degree, attention entropy) or stability analysis under specular highlights/occlusions provided; this directly bears on whether DeGAT reliably enforces global consistency without artifacts or dataset-specific tuning.
Authors: We agree that isolating the dynamic graph construction is essential to substantiate its contribution. In the revised manuscript, we have added a dedicated ablation study in Section 4.2 that directly compares the full DeGAT module against a static spatial neighborhood variant, confirming the benefit of feature-space dynamic graphs. We now report graph statistics (average degree and attention entropy) in Section 3.2 and include a stability analysis with both quantitative metrics and qualitative examples under specular highlights and occlusions in the supplementary material, showing consistent global coherence without introduced artifacts. revision: yes
Circularity Check
No circularity; empirical results on external datasets are independent of any derivation
full rationale
The paper introduces EndoVGGT with a DeGAT module that dynamically builds feature-space graphs for depth estimation. All central claims (PSNR/SSIM gains on SCARED, zero-shot generalization to EndoNeRF) are presented as outcomes of experimental evaluation on held-out datasets rather than quantities derived from equations or parameters fitted to the target metrics. No mathematical derivations, self-definitional relations, or fitted-input predictions appear in the provided text. Any self-citations are incidental and not load-bearing for the reported numbers, which remain externally falsifiable by re-running the model on the same public benchmarks. The chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamic feature-space graphs can propagate structural cues across occlusions in surgical scenes.
invented entities (1)
-
Deformation-aware Graph Attention (DeGAT) module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DeGAT dynamically constructs feature-space semantic graphs... attention logit ℓij = a⊤LeakyReLU(Wproj[xt,i∥xt,j])
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1: Stability... ∥xout t,i∥2 ≤ ∥xt,i∥2 + maxj∈N(i)∥Wval xt,j∥2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2101.01133 (2021)
Allan, M., Mcleod, J., Wang, C., Rosenthal, J.C., Hu, Z., Gard, N., Eisert, P., Fu, K.X., Zeffiro, T., Xia, W., et al.: Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133 (2021)
-
[2]
JAMA Otolaryngology–Head & Neck Surgery150(4), 318–326 (2024)
Bartholomew, R.A., Zhou, H., Boreel, M., Suresh, K., Gupta, S., Mitchell, M.B., Hong, C., Lee, S.E., Smith, T.R., Guenette, J.P., et al.: Surgical navigation in the anterior skull base using 3-dimensional endoscopy and surface reconstruction. JAMA Otolaryngology–Head & Neck Surgery150(4), 318–326 (2024)
2024
-
[3]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.072894(5), 11 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
International Journal of Computer Assisted Radiology and Surgery19(6), 1013–1020 (2024)
Cui, B., Islam, M., Bai, L., Ren, H.: Surgical-dino: adapter learning of founda- tion models for depth estimation in endoscopic surgery. International Journal of Computer Assisted Radiology and Surgery19(6), 1013–1020 (2024)
2024
-
[6]
International journal of computer assisted radiology and surgery14(7), 1217–1225 (2019)
Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3d convolutional neural networks. International journal of computer assisted radiology and surgery14(7), 1217–1225 (2019)
2019
-
[7]
Journal of imaging11(2), 44 (2025)
Göbel, B., Huurdeman, J., Reiterer, A., Möller, K.: Robot-based procedure for 3d reconstruction of abdominal organs using the iterative closest point and pose graph algorithms. Journal of imaging11(2), 44 (2025)
2025
-
[8]
Advances in neural information processing systems35, 8291–8303 (2022)
Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision gnn: An image is worth graph of nodes. Advances in neural information processing systems35, 8291–8303 (2022)
2022
-
[9]
He, Z., Wang, T.: Openlrm: Open-source large reconstruction models (2023)
2023
-
[10]
Scientific Reports13(1), 15380 (2023)
Hirohata, Y., Sogabe, M., Miyazaki, T., Kawase, T., Kawashima, K.: Confidence- aware self-supervised learning for dense monocular depth estimation in dynamic laparoscopic scene. Scientific Reports13(1), 15380 (2023)
2023
-
[11]
ACM Trans
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)
2023
-
[12]
arXiv preprint arXiv:2401.12561 (2024)
Liu, Y., Li, C., Yang, C., Yuan, Y.: Endogaussian: Real-time gaussian splatting for dynamic endoscopic scene reconstruction. arXiv preprint arXiv:2401.12561 (2024)
-
[13]
arXiv preprint arXiv:2302.13219 (2023)
Lu, Y., Wei, R., Li, B., Chen, W., Zhou, J., Dou, Q., Sun, D., Liu, Y.h.: Autonomous intelligent navigation for flexible endoscopy using monocular depth guidance and 3-d shape planning. arXiv preprint arXiv:2302.13219 (2023)
-
[14]
Commu- nications of the ACM65(1), 99–106 (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)
2021
-
[15]
In: 2009 Annual International Con- ference of the IEEE Engineering in Medicine and Biology Society
Mountney, P., Yang, G.Z.: Dynamic view expansion for minimally invasive surgery using simultaneous localization and mapping. In: 2009 Annual International Con- ference of the IEEE Engineering in Medicine and Biology Society. pp. 1184–1187. IEEE (2009)
2009
-
[16]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Medical image analysis71, 102058 (2021) 10 F
Ozyoruk, K.B., Gokceler, G.I., Bobrow, T.L., Coskun, G., Incetan, K., Almalioglu, Y., Mahmood, F., Curto, E., Perdigoto, L., Oliveira, M., et al.: Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Medical image analysis71, 102058 (2021) 10 F. Fan et al
2021
-
[18]
In: Proceedings of the AAAI conference on artificial intelligence
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[19]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
2021
-
[20]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
2016
-
[21]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5459–5469 (2022)
2022
-
[22]
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
In: Proceedings of the Symposium on Modeling and Simulation in Medicine
Wagner, A., Rozenblit, J.W.: Augmented reality visual guidance for spatial percep- tion in the computer assisted surgical trainer. In: Proceedings of the Symposium on Modeling and Simulation in Medicine. pp. 1–12 (2017)
2017
-
[24]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
2025
-
[25]
arXiv preprint arXiv:2503.22437 (2025)
Wang, X., Zhang, S., Huang, B., Stoyanov, D., Mazomenos, E.B.: Endolrmgs: Complete endoscopic scene reconstruction combining large reconstruction modelling and gaussian splatting. arXiv preprint arXiv:2503.22437 (2025)
-
[26]
In: International conference on medical image computing and computer-assisted intervention
Wang,Y.,Long,Y.,Fan,S.H.,Dou,Q.:Neuralrenderingforstereo3dreconstruction of deformable tissues in robotic surgery. In: International conference on medical image computing and computer-assisted intervention. pp. 431–441. Springer (2022)
2022
-
[27]
Artificial Intelligence Surgery4(3), 187–198 (2024)
Wei, R., Guo, J., Lu, Y., Zhong, F., Liu, Y., Sun, D., Dou, Q.: Scale-aware monocular reconstruction via robot kinematics and visual data in neural radiance fields. Artificial Intelligence Surgery4(3), 187–198 (2024)
2024
-
[28]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Xu, M., Guo, Z., Wang, A., Bai, L., Ren, H.: A review of 3d reconstruction techniques for deformable tissues in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 157–167. Springer (2024)
2024
-
[29]
In: Proceedings of the European conference on computer vision (ECCV)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV). pp. 767–783 (2018)
2018
-
[30]
In: International conference on medical image computing and computer-assisted intervention
Zha, R., Cheng, X., Li, H., Harandi, M., Ge, Z.: Endosurf: Neural surface recon- struction of deformable tissues with stereo endoscope videos. In: International conference on medical image computing and computer-assisted intervention. pp. 13–23. Springer (2023)
2023
-
[31]
APPENDIX 11 6 Appendix 7 Theoretical Properties of DeGAT This appendix formalizes several properties of DeGAT that support its use as a geometry-consistent feature refinement operator under deformation and occlusion. 7.1 Row-stochastic aggregation and convexity Lemma 1(Row-stochasticity of DeGAT attention).For each nodei, the attention coefficients {αij}j...
-
[32]
IMPLEMENTATION DETAILS OF DEGAT 13 7.3 Permutation equivariance (token indexing should not matter) Proposition 2(Permutation equivariance of one-hop DeGAT).Letπ be a permutation of token indices andPthe corresponding permutation matrix. If token features and coordinates are permuted consistently,X′ =PXandp ′ π(i) =p i, then the DeGAT output permutes in th...
-
[33]
Forr 2 >0, substituting ˆC ⋆ =α/(γr 2)intoJyields min ˆC>0 J( ˆC) =α−αlog α γr 2 =αlog(γr 2) +const, where “const” is independent of the model outputs
EXPERIMENT DETAILS 15 Corollary 3(Equivalent marginal penalty after eliminating confidence). Forr 2 >0, substituting ˆC ⋆ =α/(γr 2)intoJyields min ˆC>0 J( ˆC) =α−αlog α γr 2 =αlog(γr 2) +const, where “const” is independent of the model outputs. Proof. Direct substitution givesJ ( ˆC ⋆) = γr 2 ·α/ (γr 2) −αlog (α/(γr 2)). Rear- ranging yields the stated fo...
-
[34]
–Update Attention Parameters{a, Wproj}:Let δij = ∇⊤ xagg,i vj be the gradient w.r.t
Backward Propagation.Given the gradient from the task lossL, denoted as ∇xagg,i = ∂L ∂xagg,i , the gradients for the learnable parameters are computed via the chain rule: –Update Value MatrixWval:The gradient flows through the weighted sum and the linear projection: ∇Wval ← X b,i X j∈N(i) ∇xagg,i ·α ij ⊗x j. –Update Attention Parameters{a, Wproj}:Let δij ...
-
[35]
EXPERIMENT DETAILS 17 11.2 Attention-Level DeGAT Implementation Details: MLP Bias Reference.Raffel et al.,Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). Input. –Semantic features:F ∈R B×N×C, patch tokens extracted from a frozen DINOv2 backbone. –Attention components:QueryQand KeyK ∈R B×H×N×d head from the VGGT fram...
-
[36]
The pairwise Euclidean distance matrix D∈R B×N×N is computed as di,j =∥f i −f j∥2 = vuut CX c=1 (fi,c −f j,c)2
Semantic Distance Calculation.We construct a fully connected semantic graph in which each node corresponds to a patch token, and the edge weights represent semantic dissimilarity. The pairwise Euclidean distance matrix D∈R B×N×N is computed as di,j =∥f i −f j∥2 = vuut CX c=1 (fi,c −f j,c)2. Smaller values ofdi,j indicate higher semantic similarity, e.g., ...
-
[37]
Logarithmic Transformation.To reduce the influence of outliers while increasing sensitivity to highly similar patches, a logarithmic transformation is applied: ˜di,j = log (di,j + 1). 3.Linear Mapping and Quantization.The transformed distances are nor- malized relative to the maximum semantic distance within the current view to ensure scale invariance: ra...
-
[38]
This mechanism allows the model to learn distinct attention bonuses or penalties for different levels of semantic similarity
Bias Lookup.For each attention headh, a learnable scalar bias is retrieved from the embedding table: b(h) i,j =b[Idx i,j]h . This mechanism allows the model to learn distinct attention bonuses or penalties for different levels of semantic similarity
-
[39]
Injection into Attention Mechanism.The semantic bias is added directly to the attention logits in the VGGT frame attention blocks: Attention(h) i,j =Softmax qi ·k ⊤ j√dhead +b (h) i,j !
-
[40]
The gradient of the loss with respect to the bias term is denoted asδ(h) i,j = ∂L ∂b(h) i,j
Backward Propagation.Let L be the total loss. The gradient of the loss with respect to the bias term is denoted asδ(h) i,j = ∂L ∂b(h) i,j . Since the bias is added directly to the logits, this gradient is derived from the standard Softmax backward pass. The learnable embedding tablebis updated by aggregating gradients from all patch pairs within the same ...
-
[41]
, B}, the continuous semantic bias is computed as follows
EXPERIMENT DETAILS 19 Procedure.Given a batch indexb∈ { 1, . . . , B}, the continuous semantic bias is computed as follows
-
[42]
The pairwise Euclidean distance matrix D∈R B×N×N is computed between all patch tokens: di,j =∥f i −f j∥2
Semantic Euclidean Distance.We abandon spatial coordinates in favor of feature-space representations. The pairwise Euclidean distance matrix D∈R B×N×N is computed between all patch tokens: di,j =∥f i −f j∥2 . This distance measures the raw semantic discrepancy between two patch tokens
-
[43]
First, a logarithmic transformation is applied: ˆdi,j = log(d i,j + 1)
Log-Space Normalization.To handle the heavy-tailed distribution of feature distances while preserving high resolution for semantically similar tokens, the distances are transformed and normalized in log space. First, a logarithmic transformation is applied: ˆdi,j = log(d i,j + 1). The transformed distances are then normalized using the maximum distance wi...
-
[44]
Here,W 1 ∈R M×1,W 2 ∈R H×M, andb 1,b 2 are learnable bias terms
Continuous Bias Generation.Instead of a discrete embedding lookup, a lightweight MLPΨ is employed to map the continuous distance coordinate to a head-specific attention bias: Bi,j =Ψ(x i,j) =W 2 ReLU(W1xi,j +b 1) +b 2. Here,W 1 ∈R M×1,W 2 ∈R H×M, andb 1,b 2 are learnable bias terms. The MLP enables approximation of arbitrary continuous functions, thereby ...
-
[45]
Injection into Attention Mechanism.The generated continuous bias is added directly to the attention logits of the self-attention operation: Attention(h) i,j =Softmax qi ·k ⊤ j√dhead +B (h) i,j !
-
[46]
Letδ(h) i,j = ∂L ∂B(h) i,j be the gradient of the loss with respect to the generated bias
Backward Propagation.Unlike quantization-based methods, the MLP projection enables end-to-end gradient flow. Letδ(h) i,j = ∂L ∂B(h) i,j be the gradient of the loss with respect to the generated bias. 20 F. Fan et al. –Update MLP ParametersΘΨ:Standard backpropagation is applied to update the MLP weights. Letzi,j = ReLU(W1xi,j +b 1)be the hidden activation....
-
[47]
cutting”, which depicts tissue excision with topological changes, and “pulling
EXPERIMENT DETAILS 21 11.5 Implementation Details of FiLM-based Camera Token Modulation We adopt a Feature-wise Linear Modulation (FiLM) mechanism [18] to condition the camera token on image content. This design allows the camera token to adapt dynamically to the input frames while preserving training stability, as indicated by thecls+ FiLM modulation in ...
-
[48]
This subsection provides definitions of the evaluation metrics; readers familiar with them may skip it
EXPERIMENT DETAILS 23 11.8 Formulas for PSNR, SSIM, and LPIPS To quantitatively evaluate the reconstruction quality, we employ three standard metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). This subsection provides definitions of the evaluation metrics; readers ...
-
[49]
For PSNR and SSIM, the higher↑ the better
ADDITIONAL EXPERIMENTS 25 Table 3:Experiment results on EndoNeRF and SCARED dataset using PSNR, SSIM, and LPIPS metrics. For PSNR and SSIM, the higher↑ the better. For LPIPS, the lower ↓ the better. Thebestresults are highlighted in green, and thesecond bestresults are underlined. Dataset Method PSNR ↑SSIM↑LPIPS↓ EndoNeRF- pulling VGGT 23.349 0.659 0.396 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.