NAIMA: Semantics Aware RGB Guided Depth Super-Resolution
Pith reviewed 2026-05-10 20:09 UTC · model grok-4.3
The pith
Global semantic priors from pretrained vision transformers improve guided depth super-resolution by correcting misleading RGB texture cues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce the NAIMA architecture that combines a pretrained vision transformer with Guided Token Attention blocks. These blocks use cross-attention to align RGB spatial features with depth encodings while selectively injecting semantic context from different layers of the transformer. This semantics-aware approach leads to significant performance gains over prior methods on multiple scaling factors and datasets.
What carries the argument
The Guided Token Attention module, which performs iterative cross-attention between RGB and depth features to incorporate global semantic priors from a pretrained vision transformer.
Load-bearing premise
The global semantic priors from the pretrained vision transformer reliably disambiguate misleading local RGB cues in depth discontinuities without creating new errors.
What would settle it
Evaluating the method on images where semantic labels conflict with actual depth structure, such as a flat textured wall with varying colors but uniform depth, to check if performance degrades relative to non-semantic methods.
Figures
read the original abstract
Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes NAIMA, a semantics-aware architecture for guided depth super-resolution (GDSR). It extracts global contextual semantic priors from frozen pretrained DINOv2 vision transformer token embeddings and injects them into RGB and depth encoders via a Guided Token Attention (GTA) module that performs iterative cross-attention alignment. The central claim is that this distillation of semantic knowledge corrects misleading RGB texture cues, yielding significant quantitative improvements over prior GDSR methods across multiple upscaling factors and datasets.
Significance. If the reported gains are reproducible and demonstrably attributable to the semantic priors rather than added architectural capacity, the work would offer a parameter-efficient route to incorporating high-level context from foundation models into low-level multi-modal restoration tasks. This would be of moderate interest to the GDSR and guided super-resolution community, particularly if accompanied by evidence that the frozen DINOv2 tokens reliably encode depth discontinuities.
major comments (3)
- [§3] §3 (GTA module description): The construction assumes that global DINOv2 token embeddings, extracted from a model pretrained on natural-image classification/segmentation, will selectively correct depth-boundary errors induced by RGB texture without introducing new artifacts. No analysis, visualization, or boundary-specific metric is provided to verify that the cross-attention alignment targets depth discontinuities rather than color correlations; the absence of a boundary-aware loss or explicit alignment regularizer leaves this assumption untested.
- [§4] §4 (Experiments): The claim of 'significant improvements across multiple scaling factors and datasets' is presented without accompanying quantitative tables, per-method error maps, or ablation studies that isolate the contribution of the DINOv2 priors from the added GTA blocks. Without these, it is impossible to determine whether the gains stem from semantic distillation or simply from increased model capacity.
- [§3.2] §3.2 (Architecture integration): The manuscript does not state whether the DINOv2 backbone remains entirely frozen or receives any task-specific adaptation. If frozen, the lack of any mechanism to ensure that the injected tokens respect depth geometry (as opposed to propagating texture-induced errors) constitutes a load-bearing gap for the central claim.
minor comments (2)
- [Abstract] The abstract repeats the phrase 'distilling semantic knowledge' multiple times; a single concise statement would improve readability.
- [§3] Notation for the GTA cross-attention (e.g., query/key/value definitions) should be introduced explicitly with equations rather than prose only.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the opportunity to address the concerns regarding the GTA module, experimental validation, and architectural details. We will revise the paper accordingly to strengthen the presentation and evidence for our claims.
read point-by-point responses
-
Referee: [§3] §3 (GTA module description): The construction assumes that global DINOv2 token embeddings, extracted from a model pretrained on natural-image classification/segmentation, will selectively correct depth-boundary errors induced by RGB texture without introducing new artifacts. No analysis, visualization, or boundary-specific metric is provided to verify that the cross-attention alignment targets depth discontinuities rather than color correlations; the absence of a boundary-aware loss or explicit alignment regularizer leaves this assumption untested.
Authors: We agree that the current manuscript would benefit from explicit verification of the cross-attention behavior. In the revised version, we will add visualizations of GTA attention maps overlaid on depth boundaries and introduce a boundary-specific evaluation metric (e.g., edge accuracy or boundary F1-score) to quantify improvements at discontinuities. While we did not add a dedicated boundary-aware loss—preferring to rely on the semantic priors from DINOv2 and the iterative cross-attention design—we will expand the discussion in §3 to explain how the global token context is intended to override local texture cues. This addresses the untested assumption without altering the core method. revision: yes
-
Referee: [§4] §4 (Experiments): The claim of 'significant improvements across multiple scaling factors and datasets' is presented without accompanying quantitative tables, per-method error maps, or ablation studies that isolate the contribution of the DINOv2 priors from the added GTA blocks. Without these, it is impossible to determine whether the gains stem from semantic distillation or simply from increased model capacity.
Authors: The manuscript contains quantitative results across datasets and scales, but we acknowledge that the presentation lacks sufficient detail in the form of comprehensive tables, error maps, and targeted ablations. In the revision, we will include full per-method error maps, expanded tables reporting all metrics, and a dedicated ablation study that replaces DINOv2 tokens with random or null priors while keeping GTA blocks fixed. This will isolate the semantic contribution from any capacity increase and directly support the central claim. revision: yes
-
Referee: [§3.2] §3.2 (Architecture integration): The manuscript does not state whether the DINOv2 backbone remains entirely frozen or receives any task-specific adaptation. If frozen, the lack of any mechanism to ensure that the injected tokens respect depth geometry (as opposed to propagating texture-induced errors) constitutes a load-bearing gap for the central claim.
Authors: The DINOv2 backbone is entirely frozen, as indicated in the architecture description. We will clarify this explicitly in the revised §3.2 and add a short discussion of how the GTA cross-attention mechanism—conditioned on depth encoder features—aligns the semantic tokens to depth geometry. The iterative alignment process is designed to prioritize structural consistency over RGB texture correlations, thereby reducing the risk of propagating misleading cues. revision: yes
Circularity Check
No significant circularity in NAIMA architecture proposal
full rationale
The paper introduces an empirical architecture (NAIMA) that fuses frozen DINOv2 token embeddings via GTA cross-attention blocks into RGB/depth encoders for guided depth super-resolution. No equations, derivations, or parameter-fitting steps are described that would reduce a claimed prediction back to the input by construction. The semantic priors originate from an external pretrained model (DINOv2) whose training objective is independent of the present GDSR task, and performance improvements are asserted via experimental comparison rather than self-referential fitting or renamed known results. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided abstract or described method.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sgnet: Structure guided network via gradient-frequency awareness for depth map super-resolution,
Z. Wang, Z. Yan, and J. Yang, “Sgnet: Structure guided network via gradient-frequency awareness for depth map super-resolution,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5823–
work page 2024
-
[2]
Struc- ture flow-guided network for real depth super-resolution,
J. Yuan, H. Jiang, X. Li, J. Qian, J. Li, and J. Yang, “Struc- ture flow-guided network for real depth super-resolution,” inProceedings of the AAAI Conference on Artificial Intel- ligence, vol. 37, no. 3, 2023, pp. 3340–3348. 1
work page 2023
-
[3]
Guided depth map super-resolution: A survey,
Z. Zhong, X. Liu, J. Jiang, D. Zhao, and X. Ji, “Guided depth map super-resolution: A survey,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–36, 2023. 1
work page 2023
-
[4]
Recurrent structure attention guidance for depth super- resolution,
J. Yuan, H. Jiang, X. Li, J. Qian, J. Li, and J. Yang, “Recurrent structure attention guidance for depth super- resolution,” inProceedings of the AAAI Conference on Ar- tificial Intelligence, vol. 37, no. 3, 2023, pp. 3331–3339. 1
work page 2023
-
[5]
Dornet: A degradation oriented and regularized net- work for blind depth super-resolution,
Z. Wang, Z. Yan, J. Pan, G. Gao, K. Zhang, and J. Yang, “Dornet: A degradation oriented and regularized net- work for blind depth super-resolution,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15 813–15 822. 1, 2, 3, 7
work page 2025
-
[6]
Hscs: Hierarchical sparsity based co-saliency detec- tion for rgbd images,
R. Cong, J. Lei, H. Fu, Q. Huang, X. Cao, and N. Ling, “Hscs: Hierarchical sparsity based co-saliency detec- tion for rgbd images,”IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1660–1671, 2018. 1
work page 2018
-
[7]
R. Cong, J. Lei, C. Zhang, Q. Huang, X. Cao, and C. Hou, “Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion,” IEEE Signal Processing Letters, vol. 23, no. 6, pp. 819– 823, 2016. 1
work page 2016
-
[8]
Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation,
Q. Tang, R. Cong, R. Sheng, L. He, D. Zhang, Y . Zhao, and S. Kwong, “Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation,” in Proceedings of the 29th acm international conference on multimedia, 2021, pp. 2148–2157. 1, 2, 3
work page 2021
-
[9]
Asif-net: Attention steered interweave fusion network for rgb-d salient object detec- tion,
C. Li, R. Cong, S. Kwong, J. Hou, H. Fu, G. Zhu, D. Zhang, and Q. Huang, “Asif-net: Attention steered interweave fusion network for rgb-d salient object detec- tion,”IEEE transactions on cybernetics, vol. 51, no. 1, pp. 88–100, 2020. 1, 3
work page 2020
-
[10]
Rgb- d salient object detection with cross-modality modulation and selection,
C. Li, R. Cong, Y . Piao, Q. Xu, and C. C. Loy, “Rgb- d salient object detection with cross-modality modulation and selection,” inEuropean conference on computer vi- sion. Springer, 2020, pp. 225–241. 1
work page 2020
-
[11]
P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation,” in Proceedings of the IEEE/CVF Conference on computer vi- sion and pattern recognition, 2019, pp. 2624–2632. 2, 3
work page 2019
-
[12]
arXiv preprint arXiv:2002.12319 (2020)
V . Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon, “Semantically-guided representation learning for self-supervised monocular depth,”arXiv preprint arXiv:2002.12319, 2020. 2, 3
-
[13]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.- Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. 2, 3
work page 2023
-
[14]
Depth anything: Unleashing the power of large-scale un- labeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale un- labeled data,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2024, pp. 10 371–10 381. 2
work page 2024
-
[15]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
High-accuracy stereo depth maps using structured light,
D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” in2003 IEEE Computer So- ciety Conference on Computer Vision and Pattern Recog- nition, 2003. Proceedings., vol. 1. IEEE, 2003, pp. I–I. 2
work page 2003
-
[17]
Depth enhancement via low- rank matrix completion,
S. Lu, X. Ren, and F. Liu, “Depth enhancement via low- rank matrix completion,” inProceedings of the IEEE con- ference on computer vision and pattern recognition, 2014, pp. 3390–3397. 2
work page 2014
-
[18]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in 10 European conference on computer vision. Springer, 2012, pp. 746–760. 2
work page 2012
-
[19]
Towards fast and accurate real- world depth super-resolution: Benchmark dataset and baseline,
L. He, H. Zhu, F. Li, H. Bai, R. Cong, C. Zhang, C. Lin, M. Liu, and Y . Zhao, “Towards fast and accurate real- world depth super-resolution: Benchmark dataset and baseline,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2021, pp. 9229–
work page 2021
-
[20]
Depth map super-resolution via cascaded transformers guidance,
I. Ariav and I. Cohen, “Depth map super-resolution via cascaded transformers guidance,”Frontiers in Signal Pro- cessing, vol. 2, p. 847890, 2022. 3
work page 2022
-
[21]
Y . Zuo, H. Wang, Y . Fang, X. Huang, X. Shang, and Q. Wu, “Mig-net: Multi-scale network alterna- tively guided by intensity and gradient features for depth map super-resolution,”IEEE Transactions on Multimedia, vol. 24, pp. 3506–3519, 2021. 3
work page 2021
-
[22]
Deformable kernel net- works for joint image filtering,
B. Kim, J. Ponce, and B. Ham, “Deformable kernel net- works for joint image filtering,”International Journal of Computer Vision, vol. 129, no. 2, pp. 579–600, 2021. 3, 5, 7
work page 2021
-
[23]
Depth map super- resolution by deep multi-scale guidance,
T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super- resolution by deep multi-scale guidance,” inEuropean conference on computer vision. Springer, 2016, pp. 353–
work page 2016
-
[24]
Learning graph regularisation for guided super-resolution,
R. De Lutio, A. Becker, S. D’Aronco, S. Russo, J. D. Weg- ner, and K. Schindler, “Learning graph regularisation for guided super-resolution,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1979–1988. 3
work page 2022
-
[25]
High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion,
Z. Zhong, X. Liu, J. Jiang, D. Zhao, Z. Chen, and X. Ji, “High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion,”IEEE Transactions on Image Processing, vol. 31, pp. 648–663, 2021. 3
work page 2021
-
[26]
Codon: On orchestrating cross-domain attentions for depth super- resolution,
Y . Yang, Q. Cao, J. Zhang, and D. Tao, “Codon: On orchestrating cross-domain attentions for depth super- resolution,”International Journal of Computer Vision, vol. 130, no. 2, pp. 267–284, 2022. 3
work page 2022
-
[27]
Channel attention based iterative residual learn- ing for depth map super-resolution,
X. Song, Y . Dai, D. Zhou, L. Liu, W. Li, H. Li, and R. Yang, “Channel attention based iterative residual learn- ing for depth map super-resolution,” inProceedings of the ieee/cvf conference on computer vision and pattern recog- nition, 2020, pp. 5631–5640. 3
work page 2020
-
[28]
Learning complementary correlations for depth super- resolution with incomplete data in real world,
Z. Yan, K. Wang, X. Li, Z. Zhang, G. Li, J. Li, and J. Yang, “Learning complementary correlations for depth super- resolution with incomplete data in real world,”IEEE trans- actions on neural networks and learning systems, vol. 35, no. 4, pp. 5616–5626, 2022. 3
work page 2022
-
[29]
Image super-resolution using very deep residual channel attention networks,
Y . Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y . Fu, “Image super-resolution using very deep residual channel attention networks,” inProceedings of the European con- ference on computer vision (ECCV), 2018, pp. 286–301. 3
work page 2018
-
[30]
Symmetric uncertainty-aware feature transmission for depth super-resolution,
W. Shi, M. Ye, and B. Du, “Symmetric uncertainty-aware feature transmission for depth super-resolution,” inPro- ceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3867–3876. 4, 7
work page 2022
-
[31]
T. Nasir, D. Liu, and A. Mian, “Implicit neu- ral representation-based continuous single image su- per resolution: An empirical study,”arXiv preprint arXiv:2601.17723, 2026. 5
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Guided depth super-resolution by deep anisotropic diffusion,
N. Metzger, R. C. Daudt, and K. Schindler, “Guided depth super-resolution by deep anisotropic diffusion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 237–18 246. 5, 7
work page 2023
-
[33]
Joint implicit image func- tion for guided depth super-resolution,
J. Tang, X. Chen, and G. Zeng, “Joint implicit image func- tion for guided depth super-resolution,” inProceedings of the 29th acm international conference on multimedia, 2021, pp. 4390–4399. 7
work page 2021
-
[34]
Discrete cosine transform network for guided depth map super- resolution,
Z. Zhao, J. Zhang, S. Xu, Z. Lin, and H. Pfister, “Discrete cosine transform network for guided depth map super- resolution,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5697–5707. 7
work page 2022
-
[35]
Spherical space feature decompo- sition for guided depth map super-resolution,
Z. Zhao, J. Zhang, X. Gu, C. Tan, S. Xu, Y . Zhang, R. Tim- ofte, and L. Van Gool, “Spherical space feature decompo- sition for guided depth map super-resolution,” inProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, 2023, pp. 12 547–12 558. 7
work page 2023
-
[36]
Deep at- tentional guided image filtering,
Z. Zhong, X. Liu, J. Jiang, D. Zhao, and X. Ji, “Deep at- tentional guided image filtering,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 9, pp. 12 236–12 250, 2023. 7 11 Appendix We present additional implementation and analytical de- tails here, summarized as follows: • Appendix A: Experimental setup and training de- tails. • A...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.