Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Francesco Nex; George Vosselman; U. V. B. L. Udugama

arxiv: 2605.17661 · v1 · pith:N27MOJ4Xnew · submitted 2026-05-17 · 💻 cs.RO · cs.CV

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

U. V. B. L. Udugama , George Vosselman , Francesco Nex This is my paper

Pith reviewed 2026-05-20 12:02 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords monocular SLAMscene graphvisual inertial odometrymulti-task learningsemantic mappingindoor 3D mappingreal-time robotics

0 comments

The pith

Mono-Hydra++ builds real-time 3D scene graphs from monocular RGB and IMU data alone while matching or beating RGB-D trajectory accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mono-Hydra++, a pipeline that constructs hierarchical 3D scene graphs and metric semantic maps using only monocular RGB images plus IMU measurements. It fuses a multi-task neural network that predicts depth and semantics with a visual-inertial odometry front end, then applies sparse depth constraints, semantic masking, and volumetric fusion in the backend. This setup targets agile robots that cannot carry active depth sensors yet still need object-level and room-level understanding for exploration and interaction tasks. The authors report lower average trajectory error than the strongest RGB-D baselines on the Go-SLAM ScanNet subset and a 29.8 percent improvement on calibrated 7-Scenes, plus real-time embedded deployment.

Core claim

Mono-Hydra++ demonstrates that a DINOv3-based multi-task model for depth and semantics, combined with deep feature VIO, sparse predicted depth constraints in the pose graph, semantic masking for dynamic regions, and pose-aware temporal alignment before volumetric fusion, produces real-time monocular metric semantic maps and 3D scene graphs that achieve 1.6 percent lower average trajectory error than the strongest RGB-D baseline on the Go-SLAM ScanNet evaluation subset and 29.8 percent improvement over the strongest competing calibrated baseline on 7-Scenes.

What carries the argument

M2H-MX multi-task model supplying depth and semantic predictions that serve as sparse constraints inside the VIO-derived pose graph.

Load-bearing premise

The depth and semantic predictions from the multi-task model remain accurate enough to act as useful constraints in the pose graph without increasing overall trajectory error.

What would settle it

A measurement on the Go-SLAM ScanNet subset showing average trajectory error higher than the strongest RGB-D baseline would indicate that the added depth constraints do not deliver the claimed accuracy benefit.

Figures

Figures reproduced from arXiv: 2605.17661 by Francesco Nex, George Vosselman, U. V. B. L. Udugama.

**Figure 2.** Figure 2: System overview of MONO-HYDRA++. Monocular RGB and IMU streams [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: M2H-MX architecture overview. An input RGB image is first processed by the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Internal decoder modules of M2H-MX. Left: Register-Gated Mamba (RGM) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: BinDepthHead mapping h˜d to Dˆ with adaptive bins and residual refinement. Mixed map h˜s Conv head 3 × 3 + 1 × 1 Logits Sˆ [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 7.** Figure 7: Depth factors with motion gating. Depth samples from M2H-MX are attached to [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Depth factors in the VIO factor graph. IMU factors link consecutive poses, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on ScanNet scene0054. RGB-D results are on the top [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative semantic mesh comparison on two representative ScanNet scenes. [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative radius-based object retrieval at [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Trajectory and mesh overview for the uHumans2 ablation. The top row sum [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative uHumans2 Office H12 ablation example. The top row shows the [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Loop-closure candidate diagnostic on the highly dynamic uHumans2 Office H12 [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

**Figure 15.** Figure 15: ITC 2nd Floor full-loop reconstructions used in the real-world mapping test. [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

read the original abstract

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mono-Hydra++ shows a monocular RGB+IMU pipeline can match or beat some RGB-D baselines on indoor ATE while adding scene graphs, but the depth constraints' role is not isolated.

read the letter

Hi, the main thing to know is that this paper builds a full monocular pipeline that reports lower average trajectory error than some RGB-D baselines on ScanNet and 7-Scenes subsets while also outputting hierarchical scene graphs in real time. It does this on embedded hardware and in a real building test, which matters for payload-limited platforms. The integration itself is the practical advance: a DINOv3-based multi-task model supplies depth and semantics, sparse predicted depths are added as constraints in the VIO pose graph, semantic masking handles dynamics, and pose-aware alignment feeds a volumetric fusion backend. The Jetson deployment at 25 FPS and the ITC building run give concrete evidence that the system runs under tight limits without active depth sensors. Those numbers on trajectory error are the clearest quantitative result. The stress-test concern holds up on the material provided. The abstract and description claim the depth predictions act as useful constraints, yet there are no ablations that turn them on and off, no reported depth accuracy figures on the test subsets, and no residual comparisons in the pose graph. Without that isolation it is hard to tell whether the 1.6 % ATE edge comes from the depth output or from the deep-feature front-end and alignment steps. If the monocular depths carry typical indoor scale bias they could be down-weighted anyway. This is a minor but real gap for a claim that hinges on the multi-task model. The work is aimed at roboticists who need semantic maps on agile indoor platforms that cannot carry RGB-D or LiDAR. Readers building or evaluating lightweight mapping stacks would get usable pipeline details and hardware numbers. It has enough real deployment data and dataset comparisons to deserve a serious referee rather than a desk reject, even if revisions will likely need the missing ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Mono-Hydra++, a real-time monocular RGB+IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. It integrates M2H-MX, a DINOv3-based multi-task model for depth and semantics, with a deep-feature VIO front-end, sparse predicted depth constraints in the VIO pose graph, semantic masking for dynamic regions, and pose-aware temporal alignment prior to volumetric fusion in the Mono-Hydra backend. The central claims are 1.6% lower average trajectory error than the strongest RGB-D baseline on the Go-SLAM ScanNet evaluation subset and 29.8% ATE improvement over the strongest competing calibrated baseline on 7-Scenes, plus real-time embedded feasibility at 25.53 FPS on Jetson Orin NX and validation in a real ITC building deployment.

Significance. If the results hold, the work would be significant for enabling semantic and relational scene understanding on payload- and power-constrained agile robots without active depth sensors. Credit is given for the embedded ONNX/TensorRT deployment results and the real-world building validation, which directly address practical constraints in robotics. The approach bridges metric VIO with multi-task perception for scene graphs, which is a relevant direction for indoor mapping applications.

major comments (2)

Abstract: The headline claim of 1.6% lower ATE than the strongest RGB-D baseline on Go-SLAM ScanNet (and 29.8% on 7-Scenes) depends on sparse depth predictions from M2H-MX serving as useful constraints in the VIO-derived pose graph. No ablation isolating the depth constraints' contribution, no depth error statistics on the evaluation subsets, and no pose-graph residual comparisons with versus without the predictions are provided, so it remains unclear whether the observed gains originate from the depth output, semantic masking, temporal alignment, or the deep-feature VIO front-end.
Evaluation section: The weakest assumption—that DINOv3-based depth predictions are accurate enough to act as useful constraints without introducing scale or bias errors that degrade trajectory accuracy—is not directly tested. Systematic monocular depth errors typical in indoor scenes could be down-weighted to near zero or actively harm the optimization; without reported depth metrics or controlled experiments on the ScanNet and 7-Scenes subsets, the source of the ATE improvements cannot be verified.

minor comments (1)

Abstract: The system description would benefit from explicit cross-references to the full manuscript sections describing the M2H-MX architecture, the exact formulation of the sparse depth constraints in the pose graph, and the volumetric fusion backend.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the validation of our depth constraint contributions. We address each major comment below and will incorporate the suggested analyses into the revised manuscript to strengthen the evaluation.

read point-by-point responses

Referee: Abstract: The headline claim of 1.6% lower ATE than the strongest RGB-D baseline on Go-SLAM ScanNet (and 29.8% on 7-Scenes) depends on sparse depth predictions from M2H-MX serving as useful constraints in the VIO-derived pose graph. No ablation isolating the depth constraints' contribution, no depth error statistics on the evaluation subsets, and no pose-graph residual comparisons with versus without the predictions are provided, so it remains unclear whether the observed gains originate from the depth output, semantic masking, temporal alignment, or the deep-feature VIO front-end.

Authors: We agree that isolating the contribution of the sparse depth constraints is important for substantiating the headline claims. The current results reflect the integrated Mono-Hydra++ pipeline, but to directly address the source of the ATE gains, the revised manuscript will add: depth prediction error statistics (RMSE, AbsRel, etc.) on the Go-SLAM ScanNet and 7-Scenes subsets; an ablation study that removes the depth constraints from the VIO pose graph while keeping other components fixed and reports the resulting ATE; and a comparison of pose-graph residuals with versus without the predicted depth terms. These additions will clarify the role of the depth output relative to semantic masking and the deep-feature front-end. revision: yes
Referee: Evaluation section: The weakest assumption—that DINOv3-based depth predictions are accurate enough to act as useful constraints without introducing scale or bias errors that degrade trajectory accuracy—is not directly tested. Systematic monocular depth errors typical in indoor scenes could be down-weighted to near zero or actively harm the optimization; without reported depth metrics or controlled experiments on the ScanNet and 7-Scenes subsets, the source of the ATE improvements cannot be verified.

Authors: We acknowledge that the assumption regarding depth prediction accuracy requires direct testing. The manuscript currently emphasizes end-to-end trajectory and mapping results, but to verify that the predictions act as useful constraints without harmful bias or scale drift, we will include the depth error metrics and controlled ablation experiments described in the response to the first comment. These will be added to the evaluation section, allowing readers to assess whether the monocular depth outputs improve or degrade the optimization on the reported datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system evaluation relies on external baselines and datasets

full rationale

The paper describes a composite pipeline (M2H-MX depth/semantics model + deep-feature VIO + sparse depth constraints + semantic masking + volumetric fusion) evaluated on ScanNet and 7-Scenes subsets against external RGB-D and calibrated baselines. No equations, fitted parameters, or self-citations are shown that would make the reported ATE improvements equivalent to internal definitions or prior author results by construction. The central claims remain falsifiable via the cited public datasets and competing methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the central claims rest on the accuracy of a trained multi-task neural network and on the validity of using its depth outputs as pose-graph constraints, but no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5893 in / 1063 out tokens · 55431 ms · 2026-05-20T12:02:27.647922+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline while using only monocular RGB plus IMU input

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

[1]

Armeni, Z.-Y

I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, S. Savarese, 3d scene graph: A structure for unified semantics, 3d space, and camera, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673

work page 2019
[2]

Rosinol, M

A. Rosinol, M. Abate, Y. Chang, L. Carlone, Kimera: an open-source library for real-time metric-semantic localization and mapping, in: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 1689–1696

work page 2020
[3]

Hughes, Y

N. Hughes, Y. Chang, L. Carlone, Hydra: A real-time spatial perception system for 3d scene graph construction and optimization, arXiv preprint arXiv:2201.13360 (2022)

work page arXiv 2022
[4]

Rosinol, A

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, L. Carlone, Kimera: From slam to spatial perception with 3d dynamic scene graphs, The International Journal of Robotics Research 40 (2021) 1510–1546

work page 2021
[5]

Godard, O

C. Godard, O. Mac Aodha, M. Firman, G. J. Brostow, Digging into self-supervised monocular depth estimation, in: Proceedings of the IEEE/CVFinternationalconferenceoncomputervision, 2019, pp.3828– 3838

work page 2019
[6]

Z. Huai, G. Huang, Robocentric visual–inertial odometry, The Interna- tional Journal of Robotics Research 41 (2022) 667–689

work page 2022
[7]

Y. Liu, C. Shen, C. Yu, J. Wang, Efficient video segmentation models with per-frame inference, arXiv preprint arXiv:2202.12427 (2022)

work page arXiv 2022
[8]

D. Xu, W. Ouyang, X. Wang, N. Sebe, Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 675–684

work page 2018
[9]

Vandenhende, S

S. Vandenhende, S. Georgoulis, L. Van Gool, Mti-net: Multi-scale task interaction networks for multi-task learning, in: European conference on computer vision, Springer, 2020, pp. 527–543. 44

work page 2020
[10]

H. Ye, D. Xu, Inverted pyramid multi-task transformer for dense scene understanding, in: European Conference on Computer Vision, Springer, 2022, pp. 514–530

work page 2022
[11]

X. Xu, H. Zhao, V. Vineet, S.-N. Lim, A. Torralba, Mtformer: Multi- task learning via transformer and cross-task reasoning, in: European Conference on Computer Vision, Springer, 2022, pp. 304–321

work page 2022
[12]

Udugama, G

U. Udugama, G. Vosselman, F. Nex, Mono-hydra real-time 3d scene graph construction from monocular camera input with imu, ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences 1 (2023) 439–445

work page 2023
[13]

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

U. Udugama, G. Vosselman, F. Nex, M2h: Multi-task learning with efficient window-based cross-task attention for monocular spatial per- ception, arXiv preprint arXiv:2510.17363 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

U. V. B. L. Udugama, G. Vosselman, F. Nex, M2h-mx: Multi-task dense visual perception for real-time monocular spatial understanding, 2026. URL:https://arxiv.org/abs/2603.29236.arXiv:2603.29236

work page arXiv 2026
[15]

A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, S. Savarese, Taskonomy: Disentangling task transfer learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3712–3722

work page 2018
[16]

Lopes, T.-H

I. Lopes, T.-H. Vu, R. de Charette, Densemtl: Cross-task at- tention mechanism for dense multi-task learning, arXiv preprint arXiv:2206.08927 (2022)

work page arXiv 2022
[17]

Ranftl, A

R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12179–12188

work page 2021
[18]

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, H. Zhao, Depth anything: Unleashing the power of large-scale unlabeled data, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10371–10381. 45

work page 2024
[19]

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, M. Müller, Zoedepth: Zero- shot transfer by combining relative and metric depth, arXiv preprint arXiv:2302.12288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Brüggemann, M

D. Brüggemann, M. Kanakis, A. Obukhov, S. Georgoulis, L. Van Gool, Exploring relational context for multi-task dense prediction, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15869–15878

work page 2021
[21]

Yang, P.-T

Y. Yang, P.-T. Jiang, Q. Hou, H. Zhang, J. Chen, B. Li, Multi-task dense prediction via mixture of low-rank experts, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27927–27937

work page 2024
[22]

B. Lin, W. Jiang, P. Chen, Y. Zhang, S. Liu, Y.-C. Chen, Mtmamba: Enhancing multi-task dense scene understanding by mamba-based de- coders, in: European Conference on Computer Vision, Springer, 2024, pp. 314–330

work page 2024
[23]

B. Lin, W. Jiang, P. Chen, S. Liu, Y.-C. Chen, Mtmamba++: Enhanc- ing multi-task dense scene understanding via mamba-based decoders, IEEETransactionsonPatternAnalysisandMachineIntelligence(2025)

work page 2025
[24]

L. Bao, B. Wu, W. Liu, Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf, in: Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5977–5986

work page 2018
[25]

Tarvainen, H

A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning re- sults, Advances in neural information processing systems 30 (2017)

work page 2017
[26]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E.Buchatskaya, C.Doersch, B.AvilaPires, Z.Guo, M.GheshlaghiAzar, et al., Bootstrap your own latent-a new approach to self-supervised learning, Advances in neural information processing systems 33 (2020) 21271–21284

work page 2020
[27]

Zhang, S

Y. Zhang, S. Borse, H. Cai, F. Porikli, Auxadapt: Stable and efficient test-time adaptation for temporally consistent video semantic segmen- 46 tation, in: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, 2022, pp. 2339–2348

work page 2022
[28]

Campos, R

C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, J. D. Tardós, Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam, IEEE transactions on robotics 37 (2021) 1874–1890

work page 2021
[29]

T. Qin, P. Li, S. Shen, Vins-mono: A robust and versatile monocular visual-inertial state estimator, IEEE transactions on robotics 34 (2018) 1004–1020

work page 2018
[30]

Z. Huai, G. Huang, Square-root robocentric visual-inertial odometry with online spatiotemporal calibration, IEEE Robotics and Automation Letters 7 (2022) 9961–9968

work page 2022
[31]

L. Han, Y. Lin, G. Du, S. Lian, Deepvio: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints, in: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019, pp. 6906–6913

work page 2019
[32]

S. Fei, J. Li, L. Li, J. Liang, J. Hu, D. Zhang, J. Han, Transformer based visual inertial odometry, in: International Conference on Guidance, Navigation and Control, Springer, 2024, pp. 567–575

work page 2024
[33]

Y. Pan, W. Zhou, Y. Cao, H. Zha, Adaptive vio: Deep visual-inertial odometry with online continual learning, in: 2024 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2024, pp. 18019–18028

work page 2024
[34]

Maggio, H

D. Maggio, H. Lim, L. Carlone, Vggt-slam: Dense rgb slam optimized on the sl (4) manifold, Advances in Neural Information Processing Systems 39 (2025)

work page 2025
[35]

Maggio, L

D. Maggio, L. Carlone, Vggt-slam 2.0: Real-time dense feed- forward scene reconstruction, 2026. URL:https://arxiv.org/abs/ 2601.19887.arXiv:2601.19887

work page arXiv 2026
[36]

Murai, E

R. Murai, E. Dexheimer, A. J. Davison, MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 16695–16705. 47

work page 2025
[37]

Rosinol, J

A. Rosinol, J. J. Leonard, L. Carlone, Nerf-slam: Real-time dense monocular slam with neural radiance fields, in: 2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 3437–3444

work page 2023
[38]

Z. Zhu, S. Peng, V. Larsson, Z. Cui, M. R. Oswald, A. Geiger, M. Polle- feys, Nicer-slam: Neural implicit scene encoding for rgb slam, in: 2024 International Conference on 3D Vision (3DV), IEEE, 2024, pp. 42–52

work page 2024
[39]

X. Yang, H. Li, H. Zhai, Y. Ming, Y. Liu, G. Zhang, Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation, in: 2022 IEEE International Symposium on Mixed and Augmented Re- ality (ISMAR), IEEE, 2022, pp. 499–507

work page 2022
[40]

M. M. Johari, C. Carta, F. Fleuret, Eslam: Efficient dense slam system based on hybrid representation of signed distance fields, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2023, pp. 17408–17419

work page 2023
[41]

Z. Hong, B. Wang, H. Duan, Y. Huang, X. Li, Z. Wen, X. Wu, W. Xiang, Y. Zheng, Sp-slam: Neural real-time dense slam with scene priors, IEEE Transactions on Circuits and Systems for Video Technology (2025)

work page 2025
[42]

Gaussian-slam: Photo-realistic dense slam with gaussian splatting,

V. Yugay, Y. Li, T. Gevers, M. R. Oswald, Gaussian-slam: Photo- realistic dense slam with gaussian splatting, 2024. URL:https:// arxiv.org/abs/2312.10070.arXiv:2312.10070

work page arXiv 2024
[43]

Sandström, K

E. Sandström, K. Tateno, M. Oechsle, M. Niemeyer, L. Van Gool, M. R. Oswald, F. Tombari, Splat-slam: Globally optimized rgb-only slam with 3d gaussians, arXiv preprint arXiv:2405.16544 (2024)

work page arXiv 2024
[44]

Sucar, S

E. Sucar, S. Liu, J. Ortiz, A. Davison, iMAP: Implicit mapping and positioning in real-time, in: Proceedings of the International Conference on Computer Vision (ICCV), 2021, pp. 6229–6238

work page 2021
[45]

Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, M. Pollefeys, Nice-slam: Neural implicit scalable encoding for slam, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12786–12796. 48

work page 2022
[46]

Z. Teed, J. Deng, DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, andRGB-DCameras, Advancesinneuralinformationprocessing systems (2021)

work page 2021
[47]

Zhang, F

Y. Zhang, F. Tosi, S. Mattoccia, M. Poggi, Go-slam: Global opti- mization for consistent 3d instant reconstruction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3704–3714

work page 2023
[48]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, P. Bojanowski, DINOv3, 2025. URL:https: //arxiv.org/abs/2508.1...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Derf: Decomposed radiance fields,

S. Farooq Bhat, I. Alhashim, P. Wonka, Adabins: Depth estimation using adaptive bins, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 4008–4017. URL:http://dx.doi.org/10.1109/CVPR46437.2021.00400. doi:10. 1109/CVPR46437.2021.00400

work page doi:10.1109/cvpr46437.2021.00400 2021
[51]

Kendall, Y

A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491

work page 2018
[52]

DeTone, T

D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised interest point detection and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 224–236

work page 2018
[53]

Silberman, D

N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: European Conference on Computer Vision, 2012, pp. 746–760

work page 2012
[54]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic 49 urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223

work page 2016
[55]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, M. Nießner, Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Pro- ceedings of the IEEE conference on computer vision and pattern recog- nition, 2017, pp. 5828–5839

work page 2017
[56]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, A. Fitzgibbon, Scene coordinate regression forests for camera relocalization in rgb-d images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937

work page 2013
[57]

H. Ye, D. Xu, Taskprompter: Spatial-channel multi-task prompting for dense scene understanding, The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/ forum?id=-CwPopPJda

work page 2023
[58]

Taghavi, R

P. Taghavi, R. Langari, G. Pandey, Swinmtl: A shared architecture for simultaneous depth estimation and semantic segmentation from monoc- ular camera images, in: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2024, pp. 4957–4964

work page 2024
[59]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, H. Zhao, Point transformer v3: Simpler, faster, stronger, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 4840–4851

work page 2024
[60]

Knaebel, K

K. Knaebel, K. Yilmaz, D. de Geus, A. Hermans, D. Adrian, T. Lin- der, B. Leibe, Dino in the room: Leveraging 2d foundation models for 3d segmentation, International Conference on 3D Vision (3DV), 2026. arXiv:2503.18944. 50

work page arXiv 2026

[1] [1]

Armeni, Z.-Y

I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, S. Savarese, 3d scene graph: A structure for unified semantics, 3d space, and camera, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673

work page 2019

[2] [2]

Rosinol, M

A. Rosinol, M. Abate, Y. Chang, L. Carlone, Kimera: an open-source library for real-time metric-semantic localization and mapping, in: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 1689–1696

work page 2020

[3] [3]

Hughes, Y

N. Hughes, Y. Chang, L. Carlone, Hydra: A real-time spatial perception system for 3d scene graph construction and optimization, arXiv preprint arXiv:2201.13360 (2022)

work page arXiv 2022

[4] [4]

Rosinol, A

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, L. Carlone, Kimera: From slam to spatial perception with 3d dynamic scene graphs, The International Journal of Robotics Research 40 (2021) 1510–1546

work page 2021

[5] [5]

Godard, O

C. Godard, O. Mac Aodha, M. Firman, G. J. Brostow, Digging into self-supervised monocular depth estimation, in: Proceedings of the IEEE/CVFinternationalconferenceoncomputervision, 2019, pp.3828– 3838

work page 2019

[6] [6]

Z. Huai, G. Huang, Robocentric visual–inertial odometry, The Interna- tional Journal of Robotics Research 41 (2022) 667–689

work page 2022

[7] [7]

Y. Liu, C. Shen, C. Yu, J. Wang, Efficient video segmentation models with per-frame inference, arXiv preprint arXiv:2202.12427 (2022)

work page arXiv 2022

[8] [8]

D. Xu, W. Ouyang, X. Wang, N. Sebe, Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 675–684

work page 2018

[9] [9]

Vandenhende, S

S. Vandenhende, S. Georgoulis, L. Van Gool, Mti-net: Multi-scale task interaction networks for multi-task learning, in: European conference on computer vision, Springer, 2020, pp. 527–543. 44

work page 2020

[10] [10]

H. Ye, D. Xu, Inverted pyramid multi-task transformer for dense scene understanding, in: European Conference on Computer Vision, Springer, 2022, pp. 514–530

work page 2022

[11] [11]

X. Xu, H. Zhao, V. Vineet, S.-N. Lim, A. Torralba, Mtformer: Multi- task learning via transformer and cross-task reasoning, in: European Conference on Computer Vision, Springer, 2022, pp. 304–321

work page 2022

[12] [12]

Udugama, G

U. Udugama, G. Vosselman, F. Nex, Mono-hydra real-time 3d scene graph construction from monocular camera input with imu, ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences 1 (2023) 439–445

work page 2023

[13] [13]

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

U. Udugama, G. Vosselman, F. Nex, M2h: Multi-task learning with efficient window-based cross-task attention for monocular spatial per- ception, arXiv preprint arXiv:2510.17363 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

U. V. B. L. Udugama, G. Vosselman, F. Nex, M2h-mx: Multi-task dense visual perception for real-time monocular spatial understanding, 2026. URL:https://arxiv.org/abs/2603.29236.arXiv:2603.29236

work page arXiv 2026

[15] [15]

A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, S. Savarese, Taskonomy: Disentangling task transfer learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3712–3722

work page 2018

[16] [16]

Lopes, T.-H

I. Lopes, T.-H. Vu, R. de Charette, Densemtl: Cross-task at- tention mechanism for dense multi-task learning, arXiv preprint arXiv:2206.08927 (2022)

work page arXiv 2022

[17] [17]

Ranftl, A

R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12179–12188

work page 2021

[18] [18]

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, H. Zhao, Depth anything: Unleashing the power of large-scale unlabeled data, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10371–10381. 45

work page 2024

[19] [19]

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, M. Müller, Zoedepth: Zero- shot transfer by combining relative and metric depth, arXiv preprint arXiv:2302.12288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Brüggemann, M

D. Brüggemann, M. Kanakis, A. Obukhov, S. Georgoulis, L. Van Gool, Exploring relational context for multi-task dense prediction, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15869–15878

work page 2021

[21] [21]

Yang, P.-T

Y. Yang, P.-T. Jiang, Q. Hou, H. Zhang, J. Chen, B. Li, Multi-task dense prediction via mixture of low-rank experts, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27927–27937

work page 2024

[22] [22]

B. Lin, W. Jiang, P. Chen, Y. Zhang, S. Liu, Y.-C. Chen, Mtmamba: Enhancing multi-task dense scene understanding by mamba-based de- coders, in: European Conference on Computer Vision, Springer, 2024, pp. 314–330

work page 2024

[23] [23]

B. Lin, W. Jiang, P. Chen, S. Liu, Y.-C. Chen, Mtmamba++: Enhanc- ing multi-task dense scene understanding via mamba-based decoders, IEEETransactionsonPatternAnalysisandMachineIntelligence(2025)

work page 2025

[24] [24]

L. Bao, B. Wu, W. Liu, Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf, in: Proceed- ings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5977–5986

work page 2018

[25] [25]

Tarvainen, H

A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning re- sults, Advances in neural information processing systems 30 (2017)

work page 2017

[26] [26]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E.Buchatskaya, C.Doersch, B.AvilaPires, Z.Guo, M.GheshlaghiAzar, et al., Bootstrap your own latent-a new approach to self-supervised learning, Advances in neural information processing systems 33 (2020) 21271–21284

work page 2020

[27] [27]

Zhang, S

Y. Zhang, S. Borse, H. Cai, F. Porikli, Auxadapt: Stable and efficient test-time adaptation for temporally consistent video semantic segmen- 46 tation, in: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, 2022, pp. 2339–2348

work page 2022

[28] [28]

Campos, R

C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, J. D. Tardós, Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam, IEEE transactions on robotics 37 (2021) 1874–1890

work page 2021

[29] [29]

T. Qin, P. Li, S. Shen, Vins-mono: A robust and versatile monocular visual-inertial state estimator, IEEE transactions on robotics 34 (2018) 1004–1020

work page 2018

[30] [30]

Z. Huai, G. Huang, Square-root robocentric visual-inertial odometry with online spatiotemporal calibration, IEEE Robotics and Automation Letters 7 (2022) 9961–9968

work page 2022

[31] [31]

L. Han, Y. Lin, G. Du, S. Lian, Deepvio: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints, in: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019, pp. 6906–6913

work page 2019

[32] [32]

S. Fei, J. Li, L. Li, J. Liang, J. Hu, D. Zhang, J. Han, Transformer based visual inertial odometry, in: International Conference on Guidance, Navigation and Control, Springer, 2024, pp. 567–575

work page 2024

[33] [33]

Y. Pan, W. Zhou, Y. Cao, H. Zha, Adaptive vio: Deep visual-inertial odometry with online continual learning, in: 2024 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2024, pp. 18019–18028

work page 2024

[34] [34]

Maggio, H

D. Maggio, H. Lim, L. Carlone, Vggt-slam: Dense rgb slam optimized on the sl (4) manifold, Advances in Neural Information Processing Systems 39 (2025)

work page 2025

[35] [35]

Maggio, L

D. Maggio, L. Carlone, Vggt-slam 2.0: Real-time dense feed- forward scene reconstruction, 2026. URL:https://arxiv.org/abs/ 2601.19887.arXiv:2601.19887

work page arXiv 2026

[36] [36]

Murai, E

R. Murai, E. Dexheimer, A. J. Davison, MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 16695–16705. 47

work page 2025

[37] [37]

Rosinol, J

A. Rosinol, J. J. Leonard, L. Carlone, Nerf-slam: Real-time dense monocular slam with neural radiance fields, in: 2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 3437–3444

work page 2023

[38] [38]

Z. Zhu, S. Peng, V. Larsson, Z. Cui, M. R. Oswald, A. Geiger, M. Polle- feys, Nicer-slam: Neural implicit scene encoding for rgb slam, in: 2024 International Conference on 3D Vision (3DV), IEEE, 2024, pp. 42–52

work page 2024

[39] [39]

X. Yang, H. Li, H. Zhai, Y. Ming, Y. Liu, G. Zhang, Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation, in: 2022 IEEE International Symposium on Mixed and Augmented Re- ality (ISMAR), IEEE, 2022, pp. 499–507

work page 2022

[40] [40]

M. M. Johari, C. Carta, F. Fleuret, Eslam: Efficient dense slam system based on hybrid representation of signed distance fields, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2023, pp. 17408–17419

work page 2023

[41] [41]

Z. Hong, B. Wang, H. Duan, Y. Huang, X. Li, Z. Wen, X. Wu, W. Xiang, Y. Zheng, Sp-slam: Neural real-time dense slam with scene priors, IEEE Transactions on Circuits and Systems for Video Technology (2025)

work page 2025

[42] [42]

Gaussian-slam: Photo-realistic dense slam with gaussian splatting,

V. Yugay, Y. Li, T. Gevers, M. R. Oswald, Gaussian-slam: Photo- realistic dense slam with gaussian splatting, 2024. URL:https:// arxiv.org/abs/2312.10070.arXiv:2312.10070

work page arXiv 2024

[43] [43]

Sandström, K

E. Sandström, K. Tateno, M. Oechsle, M. Niemeyer, L. Van Gool, M. R. Oswald, F. Tombari, Splat-slam: Globally optimized rgb-only slam with 3d gaussians, arXiv preprint arXiv:2405.16544 (2024)

work page arXiv 2024

[44] [44]

Sucar, S

E. Sucar, S. Liu, J. Ortiz, A. Davison, iMAP: Implicit mapping and positioning in real-time, in: Proceedings of the International Conference on Computer Vision (ICCV), 2021, pp. 6229–6238

work page 2021

[45] [45]

Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, M. Pollefeys, Nice-slam: Neural implicit scalable encoding for slam, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12786–12796. 48

work page 2022

[46] [46]

Z. Teed, J. Deng, DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, andRGB-DCameras, Advancesinneuralinformationprocessing systems (2021)

work page 2021

[47] [47]

Zhang, F

Y. Zhang, F. Tosi, S. Mattoccia, M. Poggi, Go-slam: Global opti- mization for consistent 3d instant reconstruction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3704–3714

work page 2023

[48] [48]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, P. Bojanowski, DINOv3, 2025. URL:https: //arxiv.org/abs/2508.1...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Derf: Decomposed radiance fields,

S. Farooq Bhat, I. Alhashim, P. Wonka, Adabins: Depth estimation using adaptive bins, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 4008–4017. URL:http://dx.doi.org/10.1109/CVPR46437.2021.00400. doi:10. 1109/CVPR46437.2021.00400

work page doi:10.1109/cvpr46437.2021.00400 2021

[51] [51]

Kendall, Y

A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491

work page 2018

[52] [52]

DeTone, T

D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised interest point detection and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 224–236

work page 2018

[53] [53]

Silberman, D

N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: European Conference on Computer Vision, 2012, pp. 746–760

work page 2012

[54] [54]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic 49 urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223

work page 2016

[55] [55]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, M. Nießner, Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Pro- ceedings of the IEEE conference on computer vision and pattern recog- nition, 2017, pp. 5828–5839

work page 2017

[56] [56]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, A. Fitzgibbon, Scene coordinate regression forests for camera relocalization in rgb-d images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937

work page 2013

[57] [57]

H. Ye, D. Xu, Taskprompter: Spatial-channel multi-task prompting for dense scene understanding, The Eleventh International Conference on Learning Representations, 2023. URL:https://openreview.net/ forum?id=-CwPopPJda

work page 2023

[58] [58]

Taghavi, R

P. Taghavi, R. Langari, G. Pandey, Swinmtl: A shared architecture for simultaneous depth estimation and semantic segmentation from monoc- ular camera images, in: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2024, pp. 4957–4964

work page 2024

[59] [59]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, H. Zhao, Point transformer v3: Simpler, faster, stronger, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 4840–4851

work page 2024

[60] [60]

Knaebel, K

K. Knaebel, K. Yilmaz, D. de Geus, A. Hermans, D. Adrian, T. Lin- der, B. Leibe, Dino in the room: Leveraging 2d foundation models for 3d segmentation, International Conference on 3D Vision (3DV), 2026. arXiv:2503.18944. 50

work page arXiv 2026