DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

Chao Tian; Huiwen Han; Lulin Liu; Minseong Kweon; Nuo Chen; Srinivas Shakkottai; Wenyuan Zhao; Zhiwen Fan; Zihao Zhu

arxiv: 2606.11326 · v1 · pith:V7FFT3ARnew · submitted 2026-06-09 · 💻 cs.CV

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

Minseong Kweon , Wenyuan Zhao , Nuo Chen , Lulin Liu , Huiwen Han , Zihao Zhu , Srinivas Shakkottai , Chao Tian

show 1 more author

Zhiwen Fan

This is my paper

Pith reviewed 2026-06-27 13:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords RGB-T fusionthermal geometrylow-light 3D reconstructionfeed-forward depth estimationcamera pose estimationphysics-aware thermal modelingmulti-modal geometry

0 comments

The pith

DarkVGGT recovers accurate 3D scene geometry from RGB-thermal streams in darkness by separating reliable thermal shape cues from reflections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Feed-forward methods estimate 3D geometry directly from image sequences but lose reliability when visible light drops because RGB signals become too noisy for shape inference. DarkVGGT adds a thermal camera and processes the pair with two linked steps. Physics-inspired factorization splits each thermal image into an emissive part that stays consistent with object shapes and a sparse reflective remainder that can confuse geometry. A second routing step then pulls out shared structural patterns across the two modalities and feeds only the trustworthy parts back into the RGB pathway. The result is depth and pose estimates that hold up in low-visibility scenes while staying close to the original RGB-only performance when light is plentiful.

Core claim

DarkVGGT introduces physics-inspired thermal factorization that extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals, together with geometry-shared thermal routing that isolates modality-invariant geometric structures from thermal-specific patterns and selectively injects reliability-aware structural guidance into the RGB stream, enabling accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments.

What carries the argument

Physics-inspired thermal factorization paired with geometry-shared thermal routing, which together supply modality-invariant geometric guidance from thermal data to an RGB feed-forward reconstruction pipeline.

If this is right

Consistent gains in depth accuracy on low-visibility RGB-T benchmarks
Improved camera-pose estimates under the same degraded conditions
Performance in well-lit scenes remains close to the RGB-only baseline
The approach works inside existing feed-forward geometry pipelines without requiring changes to the core network architecture

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization idea could be tested on other modality pairs where one channel remains stable when the other degrades, such as radar or event-camera fusion.
If the routing step proves lightweight, the method might support real-time night-time mapping on mobile robots without extra daylight hardware.
The separation of emissive versus reflective thermal content might also reduce errors in applications like thermal-based material classification that currently treat the whole image as geometry.
The framework leaves open whether the same cues remain useful when thermal reflections become dense rather than sparse, a case the current benchmarks do not stress.

Load-bearing premise

Thermal images supply emissive signals that remain geometrically consistent with the scene and can be cleanly separated from reflective parts that would otherwise create ambiguity.

What would settle it

A controlled experiment on low-visibility RGB-T data in which depth and camera-pose accuracy show no gain or a clear drop when the thermal factorization and routing modules are removed compared with a standard RGB-only feed-forward baseline.

Figures

Figures reproduced from arXiv: 2606.11326 by Chao Tian, Huiwen Han, Lulin Liu, Minseong Kweon, Nuo Chen, Srinivas Shakkottai, Wenyuan Zhao, Zhiwen Fan, Zihao Zhu.

**Figure 2.** Figure 2: Overview of the DarkVGGT framework. DarkVGGT factorizes thermal embeddings [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Physics-Inspired Thermal Factorization: Per-patch εˆ captures emissive geometry cues, while ρˆ = 1 − εˆ isolates sparse reflective residuals. Given a sequence {Is} S s=1 of RGB image frames, where Is ∈ R 3×H×W , VGGT first patchifies each image and embeds it into a set of P tokens xs ∈ R P ×C using DINOv2 [34]. Each frame is augmented with a camera token cs ∈ R 1×C and four register tokens rs ∈ R 4×C . T… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of nighttime 3D geometry estimation across Dark3R, VGGT, SEAR, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Reliability-gated injection samples during training in dark and light scenes. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between SEAR and our method. Blue and red cameras represent [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Preprocessed Dark3R dataset training samples. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DarkVGGT proposes thermal factorization and routing modules to improve feed-forward 3D geometry in low light, but the abstract supplies no numbers or implementation details to check the claims.

read the letter

The paper's main move is to add thermal data to feed-forward 3D reconstruction so that depth and pose stay reliable when RGB cues collapse in darkness. It does this with two modules: a physics-inspired factorization that tries to separate emissive geometry signals from reflective residuals, and a routing step that pulls modality-invariant structure out of the thermal stream and feeds it selectively into the RGB path.

That direction makes sense for robotics and autonomous driving, where lighting varies and pure RGB methods are known to degrade. The stated goal of preserving well-lit performance while gaining in low-visibility scenes is a practical target, and grounding the first module in emissivity versus reflection is a reasonable starting point rather than pure learned fusion.

The soft spot is that the text gives no equations, no ablation numbers, no baseline comparisons, and no dataset specifics. The abstract asserts "consistent improvements" without showing error reductions, variance, or even which benchmarks were used, so the actual gain and whether the modules deliver what they promise cannot be checked. It is also unclear how much the factorization and routing differ from earlier thermal-RGB work; the description stays high-level.

This is for readers already working on multi-modal geometry estimation who want ideas for low-light robustness. A serious referee could usefully press for the missing quantitative evidence and implementation details. I would send it to peer review because the problem is real and the framing shows some care, even though the current write-up leaves the execution untested.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes DarkVGGT, an RGB-T feed-forward 3D geometry estimation framework for low-light scenes. It introduces two modules: (1) physics-inspired thermal factorization to extract emissive-dominant, geometry-consistent thermal cues while isolating reflective residuals, and (2) geometry-shared thermal routing to isolate modality-invariant structures and inject reliability-aware guidance into the RGB stream. The central claim is that these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments, supported by experiments showing consistent improvements in depth and camera pose estimation over feed-forward baselines on low-visibility RGB-T benchmarks.

Significance. If the claims hold with rigorous validation, the work would address a practical limitation of current feed-forward 3D reconstruction methods by incorporating thermal data in a physics-aware manner without incurring a performance penalty in normal lighting. The emphasis on modality-invariant geometric structures and selective guidance injection could inform future multi-modal vision systems for robotics and autonomous navigation in challenging conditions.

major comments (1)

[Abstract] Abstract: The claim that 'experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines' is presented without any quantitative results, error bars, dataset specifications, baseline names, ablation studies, or implementation details. This absence renders the central claim unverifiable and load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the abstract. We address the point below and outline the planned revision.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines' is presented without any quantitative results, error bars, dataset specifications, baseline names, ablation studies, or implementation details. This absence renders the central claim unverifiable and load-bearing for the paper's contribution.

Authors: We agree that the abstract presents the central claim at a high level without the specific quantitative details, dataset names, baselines, or error metrics that would allow immediate verification. Although the full manuscript contains these elements in the Experiments section (including benchmark names, baseline comparisons, and ablation results), the referee is correct that the abstract itself does not make the claim self-contained. To resolve this, we will revise the abstract in the next version to include concise quantitative highlights (e.g., average depth error reductions and pose accuracy gains on the cited low-visibility RGB-T benchmarks relative to the named feed-forward baselines), while preserving its brevity. This change directly addresses the concern without altering the manuscript's technical content. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and description introduce two modules (physics-inspired thermal factorization and geometry-shared thermal routing) at a high level but contain no equations, derivations, fitting procedures, predictions, or self-citations that could form a load-bearing chain. No step reduces by construction to its inputs, as there are no mathematical claims or parameter fits presented. The reader's assessment of 2.0 aligns with the absence of any derivation content. The central assertions are descriptive proposals supported by (unshown) experiments rather than self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies almost no technical detail; the single domain assumption below is inferred directly from the module description.

axioms (1)

domain assumption Thermal images can be factored into emissive-dominant geometry-consistent cues and sparse reflective residuals using physics-inspired modeling.
This premise is required for the first module to isolate useful geometric information without introducing ambiguity.

pith-pipeline@v0.9.1-grok · 5732 in / 1234 out tokens · 40963 ms · 2026-06-27T13:19:52.082719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 5 linked inside Pith

[1]

Infrared thermographic measurement of the surface temperature and emissivity of glossy materials.Journal of Building Physics, 41(6):533–546, 2018

Petr Alexa, Jaroslav Solaˇr, Filip ˇCmiel, Pavel Valíˇcek, and Miroslava Kadulová. Infrared thermographic measurement of the surface temperature and emissivity of glossy materials.Journal of Building Physics, 41(6):533–546, 2018

2018
[2]

A survey on 3d object detection methods for autonomous driving applications.IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019

Eduardo Arnold, Omar Y Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applications.IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019

2019
[3]

RGB-D and thermal sensor fusion: A systematic literature review.IEEE Access, 11:82410–82442, 2023

Martin Brenner, Napoleon H Reyes, Teo Susnjak, and Andre LC Barczak. RGB-D and thermal sensor fusion: A systematic literature review.IEEE Access, 11:82410–82442, 2023

2023
[4]

MUSt3R: Multi-view network for stereo 3D reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1050–1060, 2025

2025
[5]

Infrared thermography for convective heat transfer measurements.Experiments in fluids, 49(6):1187–1218, 2010

Giovanni Maria Carlomagno and Gennaro Cardone. Infrared thermography for convective heat transfer measurements.Experiments in fluids, 49(6):1187–1218, 2010

2010
[6]

Thermal3D-GS: Physics-induced 3D gaussians for thermal infrared novel-view synthesis

Qian Chen, Shihao Shu, and Xiangzhi Bai. Thermal3D-GS: Physics-induced 3D gaussians for thermal infrared novel-view synthesis. InEuropean Conference on Computer Vision, 2024

2024
[7]

Vr-drive: Viewpoint-robust end-to-end driving with feed-forward 3d gaussian splatting.arXiv preprint arXiv:2510.23205, 2025

Hoonhee Cho, Jae-Young Kang, Giwon Lee, Hyemin Yang, Heejun Park, Seokwoo Jung, and Kuk-Jin Yoon. Vr-drive: Viewpoint-robust end-to-end driving with feed-forward 3d gaussian splatting.arXiv preprint arXiv:2510.23205, 2025

arXiv 2025
[8]

MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025

2025
[9]

Infrared camera geometric calibration: A review and a precise thermal radiation checkerboard target.Sensors, 23(7):3479, 2023

Ahmed ElSheikh, Bassam A Abu-Nabah, Mohammad O Hamdan, and Gui-Yun Tian. Infrared camera geometric calibration: A review and a precise thermal radiation checkerboard target.Sensors, 23(7):3479, 2023

2023
[10]

More: Motion-aware feed-forward 4d reconstruction transformer.arXiv preprint arXiv:2603.05078, 2026

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, and Yu-Shen Liu. More: Motion-aware feed-forward 4d reconstruction transformer.arXiv preprint arXiv:2603.05078, 2026

arXiv 2026
[11]

Pedestrian detection in low-light conditions: A comprehensive survey.Image and Vision Computing, 148:105106, 2024

Bahareh Ghari, Ali Tourani, Asadollah Shahbahrami, and Georgi Gaydadjiev. Pedestrian detection in low-light conditions: A comprehensive survey.Image and Vision Computing, 148:105106, 2024

2024
[12]

Dark3R: Learning structure from motion in the dark.arXiv preprint arXiv:2603.05330, 2026

Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin, Zach Salehe, Benjamin Attal, Sotiris Nousias, Kyros Kutulakos, and David B Lindell. Dark3R: Learning structure from motion in the dark.arXiv preprint arXiv:2603.05330, 2026

arXiv 2026
[13]

Unsupervised visible- light images guided cross-spectrum depth estimation from dual-modality cameras.arXiv preprint arXiv:2205.00257, 2022

Yubin Guo, Haobo Jiang, Xinlei Qi, Jin Xie, Cheng-Zhong Xu, and Hui Kong. Unsupervised visible- light images guided cross-spectrum depth estimation from dual-modality cameras.arXiv preprint arXiv:2205.00257, 2022

arXiv 2022
[14]

Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes.arXiv e-prints, pages arXiv–2504, 2025

Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes.arXiv e-prints, pages arXiv–2504, 2025. 10

2025
[15]

ThermoNeRF: Joint RGB and ther- mal novel view synthesis for building facades using multimodal neural radiance fields.arXiv preprint arXiv:2403.12154, 2024

Mariam Hassan, Florent Forest, Olga Fink, and Malcolm Mielle. ThermoNeRF: Joint RGB and ther- mal novel view synthesis for building facades using multimodal neural radiance fields.arXiv preprint arXiv:2403.12154, 2024

arXiv 2024
[16]

DarkFeat: Noise-robust feature detector and descriptor for extremely low-light RAW images

Yuze He, Yubin Hu, Wang Zhao, Jisheng Li, Yong-Jin Liu, Yuxing Han, and Jiangtao Wen. DarkFeat: Noise-robust feature detector and descriptor for extremely low-light RAW images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 826–834, 2023

2023
[17]

LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[18]

Wiley New York, 1996

Frank P Incropera, David P DeWitt, Theodore L Bergman, Adrienne S Lavine, et al.Fundamentals of heat and mass transfer, volume 6. Wiley New York, 1996

1996
[19]

Gustav Kirchhoff. I. on the relation between the radiating and absorbing powers of different bodies for light and heat.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 20 (130):1–21, 1860
[20]

MrGS: Multi-modal radiance fields with 3D gaussian splatting for RGB-Thermal novel view synthesis.arXiv preprint arXiv:2511.22997, 2025

Minseong Kweon, Janghyun Kim, Ukcheol Shin, and Jinsun Park. MrGS: Multi-modal radiance fields with 3D gaussian splatting for RGB-Thermal novel view synthesis.arXiv preprint arXiv:2511.22997, 2025

arXiv 2025
[21]

Multi-modal depth estimation from misaligned thermal and RGB images

Byeongjun Kwon and Munchurl Kim. Multi-modal depth estimation from misaligned thermal and RGB images. InProceedings of the Korean Institute of Broadcast and Media Engineers Summer Conference, pages 912–915, 2024

2024
[22]

ViViD++: Vision for visibility dataset.IEEE Robotics and Automation Letters, 7(3):6282–6289, 2022

Alex Junho Lee, Younggun Cho, Young-sik Shin, Ayoung Kim, and Hyun Myung. ViViD++: Vision for visibility dataset.IEEE Robotics and Automation Letters, 7(3):6282–6289, 2022

2022
[23]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3D with MASt3R. In European conference on computer vision, pages 71–91. Springer, 2024

2024
[24]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[25]

Thermalnerf: Thermal radiance fields

Yvette Y Lin, Xin-Yi Pan, Sara Fridovich-Keil, and Gordon Wetzstein. Thermalnerf: Thermal radiance fields. In2024 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE, 2024

2024
[26]

Humans as light bulbs: 3D human reconstruction from thermal reflection

Ruoshi Liu and Carl V ondrick. Humans as light bulbs: 3D human reconstruction from thermal reflection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12531– 12542, 2023

2023
[27]

Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024

Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024

arXiv 2024
[28]

ThermalGaussian: Thermal 3D gaussian splatting.arXiv preprint arXiv:2409.07200, 2024

Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, and Anke Xue. ThermalGaussian: Thermal 3D gaussian splatting.arXiv preprint arXiv:2409.07200, 2024

arXiv 2024
[29]

Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

Pith/arXiv arXiv 2025
[30]

AnyThermal: Towards learning universal representations for thermal perception.arXiv preprint arXiv:2602.06203, 2026

Parv Maheshwari, Jay Karhade, Yogesh Chawla, Isaiah Adu, Florian Heisen, Andrew Porco, Andrew Jong, Yifei Liu, Santosh Pitla, Sebastian Scherer, et al. AnyThermal: Towards learning universal representations for thermal perception.arXiv preprint arXiv:2602.06203, 2026

arXiv 2026
[31]

Academic press, 2021

Michael F Modest and Sandip Mazumder.Radiative heat transfer. Academic press, 2021

2021
[32]

Mast3r-slam: Real-time dense slam with 3d reconstruction priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

2025
[33]

Directional reflectance and emissivity of an opaque surface.Applied optics, 4(7): 767–775, 1965

Fred E Nicodemus. Directional reflectance and emissivity of an opaque surface.Applied optics, 4(7): 767–775, 1965

1965
[34]

DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 11

Pith/arXiv arXiv 2023
[35]

Infrared thermal imaging: Fundamentals, research and applications.European Journal of Physics, 32(5):1431, 2011

Gorazd Planinsic. Infrared thermal imaging: Fundamentals, research and applications.European Journal of Physics, 32(5):1431, 2011

2011
[36]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

2021
[37]

EventVGGT: Exploring cross-modal distillation for consistent event-based depth estimation.arXiv preprint arXiv:2603.09385, 2026

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, et al. EventVGGT: Exploring cross-modal distillation for consistent event-based depth estimation.arXiv preprint arXiv:2603.09385, 2026

arXiv 2026
[38]

Ali M Reza. Realization of the contrast limited adaptive histogram equalization (clahe) for real-time image enhancement.Journal of VLSI signal processing systems for signal, image and video technology, 38(1): 35–44, 2004

2004
[39]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016
[40]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

2016
[41]

A multi-view stereo benchmark with high-resolution images and multi- camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017
[42]

3D reconstruction in robotics: A comprehensive review

Dharmendra Selvaratnam and Dena Bazazian. 3D reconstruction in robotics: A comprehensive review. Computers & Graphics, 130:104256, 2025

2025
[43]

Self-supervised depth and ego-motion esti- mation for monocular thermal video using multi-spectral consistency loss.IEEE Robotics and Automation Letters, 7(2):1103–1110, 2021

Ukcheol Shin, Kyunghyun Lee, Seokju Lee, and In So Kweon. Self-supervised depth and ego-motion esti- mation for monocular thermal video using multi-spectral consistency loss.IEEE Robotics and Automation Letters, 7(2):1103–1110, 2021

2021
[44]

Deep depth estimation from thermal image

Ukcheol Shin, Jinsun Park, and In So Kweon. Deep depth estimation from thermal image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[45]

Bridging spectral-wise and multi-spectral depth estimation via geometry-guided contrastive learning

Ukcheol Shin, Kyunghyun Lee, and Jean Oh. Bridging spectral-wise and multi-spectral depth estimation via geometry-guided contrastive learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6299–6305. IEEE, 2025

2025
[46]

SEAR: Simple and efficient adaptation of visual geometric transformers for RGB+Thermal 3D reconstruction.arXiv preprint arXiv:2603.18774, 2026

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, and Malcolm Mielle. SEAR: Simple and efficient adaptation of visual geometric transformers for RGB+Thermal 3D reconstruction.arXiv preprint arXiv:2603.18774, 2026

arXiv 2026
[47]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[48]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

arXiv 2026
[49]

Highly accurate geometric calibration for infrared cameras using inexpensive calibration targets.Measurement, 112:105–116, 2017

R Usamentiaga, DF Garcia, C Ibarra-Castanedo, and X Maldague. Highly accurate geometric calibration for infrared cameras using inexpensive calibration targets.Measurement, 112:105–116, 2017

2017
[50]

Infrared thermography for temperature measurement and non-destructive testing.Sensors, 14(7):12305– 12348, 2014

Rubén Usamentiaga, Pablo Venegas, Jon Guerediaga, Laura Vega, Julio Molleda, and Francisco G Bulnes. Infrared thermography for temperature measurement and non-destructive testing.Sensors, 14(7):12305– 12348, 2014

2014
[51]

3D reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025

2025
[52]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[53]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 12

2024
[54]

EAG3R: Event-augmented 3D geometry estimation for dynamic and extreme-lighting scenes

Xiaoshan Wu, Yifei Yu, Xiaoyang Lyu, Yihua Huang, Bo Wang, Baoheng Zhang, Zhongrui Wang, and Xiaojuan Qi. EAG3R: Event-augmented 3D geometry estimation for dynamic and extreme-lighting scenes. arXiv preprint arXiv:2512.00771, 2025

arXiv 2025
[55]

A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis

Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. InFindings of the association for computational linguistics: ACL-IJCNLP 2021, pages 4730–4738, 2021

2021
[56]

ThermalGen: Style- disentangled flow-based generative models for RGB-to-Thermal image translation.arXiv preprint arXiv:2509.24878, 2025

Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, and Giuseppe Loianno. ThermalGen: Style- disentangled flow-based generative models for RGB-to-Thermal image translation.arXiv preprint arXiv:2509.24878, 2025

arXiv 2025
[57]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025
[58]

Robo3r: Enhanc- ing robotic manipulation with accurate feed-forward 3d reconstruction.arXiv preprint arXiv:2602.10101, 2026

Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Robo3r: Enhanc- ing robotic manipulation with accurate feed-forward 3d reconstruction.arXiv preprint arXiv:2602.10101, 2026

Pith/arXiv arXiv 2026
[59]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023
[60]

STheReO: Stereo thermal dataset for research in odometry and mapping

Seungsang Yun, Minwoo Jung, Jeongyun Kim, Sangwoo Jung, Younghun Cho, Myung-Hwan Jeon, Giseop Kim, and Ayoung Kim. STheReO: Stereo thermal dataset for research in odometry and mapping. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3857–3864. IEEE, 2022

2022
[61]

MonST3R: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024

Pith/arXiv arXiv 2024
[62]

Multimodal fusion on low-quality data: A comprehensive survey.Information Fusion, page 104437, 2026

Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Qinghua Hu, Cheng Deng, Cai Xu, Jie Wen, Di Hu, et al. Multimodal fusion on low-quality data: A comprehensive survey.Information Fusion, page 104437, 2026

2026
[63]

FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025

2025
[64]

MonoTher-Depth: Enhancing thermal depth estimation via confidence-aware distillation.IEEE Robotics and Automation Letters, 10(3):2830–2837, 2025

Xingxing Zuo, Nikhil Ranganathan, Connor Lee, Georgia Gkioxari, and Soon-Jo Chung. MonoTher-Depth: Enhancing thermal depth estimation via confidence-aware distillation.IEEE Robotics and Automation Letters, 10(3):2830–2837, 2025. 13 A Technical appendices and supplementary material A.1 DarkVGGT: detailed methodology LoRA and camera tokens.After loading the...

2025

[1] [1]

Infrared thermographic measurement of the surface temperature and emissivity of glossy materials.Journal of Building Physics, 41(6):533–546, 2018

Petr Alexa, Jaroslav Solaˇr, Filip ˇCmiel, Pavel Valíˇcek, and Miroslava Kadulová. Infrared thermographic measurement of the surface temperature and emissivity of glossy materials.Journal of Building Physics, 41(6):533–546, 2018

2018

[2] [2]

A survey on 3d object detection methods for autonomous driving applications.IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019

Eduardo Arnold, Omar Y Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applications.IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019

2019

[3] [3]

RGB-D and thermal sensor fusion: A systematic literature review.IEEE Access, 11:82410–82442, 2023

Martin Brenner, Napoleon H Reyes, Teo Susnjak, and Andre LC Barczak. RGB-D and thermal sensor fusion: A systematic literature review.IEEE Access, 11:82410–82442, 2023

2023

[4] [4]

MUSt3R: Multi-view network for stereo 3D reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1050–1060, 2025

2025

[5] [5]

Infrared thermography for convective heat transfer measurements.Experiments in fluids, 49(6):1187–1218, 2010

Giovanni Maria Carlomagno and Gennaro Cardone. Infrared thermography for convective heat transfer measurements.Experiments in fluids, 49(6):1187–1218, 2010

2010

[6] [6]

Thermal3D-GS: Physics-induced 3D gaussians for thermal infrared novel-view synthesis

Qian Chen, Shihao Shu, and Xiangzhi Bai. Thermal3D-GS: Physics-induced 3D gaussians for thermal infrared novel-view synthesis. InEuropean Conference on Computer Vision, 2024

2024

[7] [7]

Vr-drive: Viewpoint-robust end-to-end driving with feed-forward 3d gaussian splatting.arXiv preprint arXiv:2510.23205, 2025

Hoonhee Cho, Jae-Young Kang, Giwon Lee, Hyemin Yang, Heejun Park, Seokwoo Jung, and Kuk-Jin Yoon. Vr-drive: Viewpoint-robust end-to-end driving with feed-forward 3d gaussian splatting.arXiv preprint arXiv:2510.23205, 2025

arXiv 2025

[8] [8]

MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025

2025

[9] [9]

Infrared camera geometric calibration: A review and a precise thermal radiation checkerboard target.Sensors, 23(7):3479, 2023

Ahmed ElSheikh, Bassam A Abu-Nabah, Mohammad O Hamdan, and Gui-Yun Tian. Infrared camera geometric calibration: A review and a precise thermal radiation checkerboard target.Sensors, 23(7):3479, 2023

2023

[10] [10]

More: Motion-aware feed-forward 4d reconstruction transformer.arXiv preprint arXiv:2603.05078, 2026

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, and Yu-Shen Liu. More: Motion-aware feed-forward 4d reconstruction transformer.arXiv preprint arXiv:2603.05078, 2026

arXiv 2026

[11] [11]

Pedestrian detection in low-light conditions: A comprehensive survey.Image and Vision Computing, 148:105106, 2024

Bahareh Ghari, Ali Tourani, Asadollah Shahbahrami, and Georgi Gaydadjiev. Pedestrian detection in low-light conditions: A comprehensive survey.Image and Vision Computing, 148:105106, 2024

2024

[12] [12]

Dark3R: Learning structure from motion in the dark.arXiv preprint arXiv:2603.05330, 2026

Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin, Zach Salehe, Benjamin Attal, Sotiris Nousias, Kyros Kutulakos, and David B Lindell. Dark3R: Learning structure from motion in the dark.arXiv preprint arXiv:2603.05330, 2026

arXiv 2026

[13] [13]

Unsupervised visible- light images guided cross-spectrum depth estimation from dual-modality cameras.arXiv preprint arXiv:2205.00257, 2022

Yubin Guo, Haobo Jiang, Xinlei Qi, Jin Xie, Cheng-Zhong Xu, and Hui Kong. Unsupervised visible- light images guided cross-spectrum depth estimation from dual-modality cameras.arXiv preprint arXiv:2205.00257, 2022

arXiv 2022

[14] [14]

Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes.arXiv e-prints, pages arXiv–2504, 2025

Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes.arXiv e-prints, pages arXiv–2504, 2025. 10

2025

[15] [15]

ThermoNeRF: Joint RGB and ther- mal novel view synthesis for building facades using multimodal neural radiance fields.arXiv preprint arXiv:2403.12154, 2024

Mariam Hassan, Florent Forest, Olga Fink, and Malcolm Mielle. ThermoNeRF: Joint RGB and ther- mal novel view synthesis for building facades using multimodal neural radiance fields.arXiv preprint arXiv:2403.12154, 2024

arXiv 2024

[16] [16]

DarkFeat: Noise-robust feature detector and descriptor for extremely low-light RAW images

Yuze He, Yubin Hu, Wang Zhao, Jisheng Li, Yong-Jin Liu, Yuxing Han, and Jiangtao Wen. DarkFeat: Noise-robust feature detector and descriptor for extremely low-light RAW images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 826–834, 2023

2023

[17] [17]

LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[18] [18]

Wiley New York, 1996

Frank P Incropera, David P DeWitt, Theodore L Bergman, Adrienne S Lavine, et al.Fundamentals of heat and mass transfer, volume 6. Wiley New York, 1996

1996

[19] [19]

Gustav Kirchhoff. I. on the relation between the radiating and absorbing powers of different bodies for light and heat.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 20 (130):1–21, 1860

[20] [20]

MrGS: Multi-modal radiance fields with 3D gaussian splatting for RGB-Thermal novel view synthesis.arXiv preprint arXiv:2511.22997, 2025

Minseong Kweon, Janghyun Kim, Ukcheol Shin, and Jinsun Park. MrGS: Multi-modal radiance fields with 3D gaussian splatting for RGB-Thermal novel view synthesis.arXiv preprint arXiv:2511.22997, 2025

arXiv 2025

[21] [21]

Multi-modal depth estimation from misaligned thermal and RGB images

Byeongjun Kwon and Munchurl Kim. Multi-modal depth estimation from misaligned thermal and RGB images. InProceedings of the Korean Institute of Broadcast and Media Engineers Summer Conference, pages 912–915, 2024

2024

[22] [22]

ViViD++: Vision for visibility dataset.IEEE Robotics and Automation Letters, 7(3):6282–6289, 2022

Alex Junho Lee, Younggun Cho, Young-sik Shin, Ayoung Kim, and Hyun Myung. ViViD++: Vision for visibility dataset.IEEE Robotics and Automation Letters, 7(3):6282–6289, 2022

2022

[23] [23]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3D with MASt3R. In European conference on computer vision, pages 71–91. Springer, 2024

2024

[24] [24]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[25] [25]

Thermalnerf: Thermal radiance fields

Yvette Y Lin, Xin-Yi Pan, Sara Fridovich-Keil, and Gordon Wetzstein. Thermalnerf: Thermal radiance fields. In2024 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE, 2024

2024

[26] [26]

Humans as light bulbs: 3D human reconstruction from thermal reflection

Ruoshi Liu and Carl V ondrick. Humans as light bulbs: 3D human reconstruction from thermal reflection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12531– 12542, 2023

2023

[27] [27]

Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024

Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024

arXiv 2024

[28] [28]

ThermalGaussian: Thermal 3D gaussian splatting.arXiv preprint arXiv:2409.07200, 2024

Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, and Anke Xue. ThermalGaussian: Thermal 3D gaussian splatting.arXiv preprint arXiv:2409.07200, 2024

arXiv 2024

[29] [29]

Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

Pith/arXiv arXiv 2025

[30] [30]

AnyThermal: Towards learning universal representations for thermal perception.arXiv preprint arXiv:2602.06203, 2026

Parv Maheshwari, Jay Karhade, Yogesh Chawla, Isaiah Adu, Florian Heisen, Andrew Porco, Andrew Jong, Yifei Liu, Santosh Pitla, Sebastian Scherer, et al. AnyThermal: Towards learning universal representations for thermal perception.arXiv preprint arXiv:2602.06203, 2026

arXiv 2026

[31] [31]

Academic press, 2021

Michael F Modest and Sandip Mazumder.Radiative heat transfer. Academic press, 2021

2021

[32] [32]

Mast3r-slam: Real-time dense slam with 3d reconstruction priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025

2025

[33] [33]

Directional reflectance and emissivity of an opaque surface.Applied optics, 4(7): 767–775, 1965

Fred E Nicodemus. Directional reflectance and emissivity of an opaque surface.Applied optics, 4(7): 767–775, 1965

1965

[34] [34]

DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 11

Pith/arXiv arXiv 2023

[35] [35]

Infrared thermal imaging: Fundamentals, research and applications.European Journal of Physics, 32(5):1431, 2011

Gorazd Planinsic. Infrared thermal imaging: Fundamentals, research and applications.European Journal of Physics, 32(5):1431, 2011

2011

[36] [36]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

2021

[37] [37]

EventVGGT: Exploring cross-modal distillation for consistent event-based depth estimation.arXiv preprint arXiv:2603.09385, 2026

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, et al. EventVGGT: Exploring cross-modal distillation for consistent event-based depth estimation.arXiv preprint arXiv:2603.09385, 2026

arXiv 2026

[38] [38]

Ali M Reza. Realization of the contrast limited adaptive histogram equalization (clahe) for real-time image enhancement.Journal of VLSI signal processing systems for signal, image and video technology, 38(1): 35–44, 2004

2004

[39] [39]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016

[40] [40]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

2016

[41] [41]

A multi-view stereo benchmark with high-resolution images and multi- camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017

[42] [42]

3D reconstruction in robotics: A comprehensive review

Dharmendra Selvaratnam and Dena Bazazian. 3D reconstruction in robotics: A comprehensive review. Computers & Graphics, 130:104256, 2025

2025

[43] [43]

Self-supervised depth and ego-motion esti- mation for monocular thermal video using multi-spectral consistency loss.IEEE Robotics and Automation Letters, 7(2):1103–1110, 2021

Ukcheol Shin, Kyunghyun Lee, Seokju Lee, and In So Kweon. Self-supervised depth and ego-motion esti- mation for monocular thermal video using multi-spectral consistency loss.IEEE Robotics and Automation Letters, 7(2):1103–1110, 2021

2021

[44] [44]

Deep depth estimation from thermal image

Ukcheol Shin, Jinsun Park, and In So Kweon. Deep depth estimation from thermal image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[45] [45]

Bridging spectral-wise and multi-spectral depth estimation via geometry-guided contrastive learning

Ukcheol Shin, Kyunghyun Lee, and Jean Oh. Bridging spectral-wise and multi-spectral depth estimation via geometry-guided contrastive learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 6299–6305. IEEE, 2025

2025

[46] [46]

SEAR: Simple and efficient adaptation of visual geometric transformers for RGB+Thermal 3D reconstruction.arXiv preprint arXiv:2603.18774, 2026

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, and Malcolm Mielle. SEAR: Simple and efficient adaptation of visual geometric transformers for RGB+Thermal 3D reconstruction.arXiv preprint arXiv:2603.18774, 2026

arXiv 2026

[47] [47]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[48] [48]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

arXiv 2026

[49] [49]

Highly accurate geometric calibration for infrared cameras using inexpensive calibration targets.Measurement, 112:105–116, 2017

R Usamentiaga, DF Garcia, C Ibarra-Castanedo, and X Maldague. Highly accurate geometric calibration for infrared cameras using inexpensive calibration targets.Measurement, 112:105–116, 2017

2017

[50] [50]

Infrared thermography for temperature measurement and non-destructive testing.Sensors, 14(7):12305– 12348, 2014

Rubén Usamentiaga, Pablo Venegas, Jon Guerediaga, Laura Vega, Julio Molleda, and Francisco G Bulnes. Infrared thermography for temperature measurement and non-destructive testing.Sensors, 14(7):12305– 12348, 2014

2014

[51] [51]

3D reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025

2025

[52] [52]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[53] [53]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 12

2024

[54] [54]

EAG3R: Event-augmented 3D geometry estimation for dynamic and extreme-lighting scenes

Xiaoshan Wu, Yifei Yu, Xiaoyang Lyu, Yihua Huang, Bo Wang, Baoheng Zhang, Zhongrui Wang, and Xiaojuan Qi. EAG3R: Event-augmented 3D geometry estimation for dynamic and extreme-lighting scenes. arXiv preprint arXiv:2512.00771, 2025

arXiv 2025

[55] [55]

A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis

Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. InFindings of the association for computational linguistics: ACL-IJCNLP 2021, pages 4730–4738, 2021

2021

[56] [56]

ThermalGen: Style- disentangled flow-based generative models for RGB-to-Thermal image translation.arXiv preprint arXiv:2509.24878, 2025

Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, and Giuseppe Loianno. ThermalGen: Style- disentangled flow-based generative models for RGB-to-Thermal image translation.arXiv preprint arXiv:2509.24878, 2025

arXiv 2025

[57] [57]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025

[58] [58]

Robo3r: Enhanc- ing robotic manipulation with accurate feed-forward 3d reconstruction.arXiv preprint arXiv:2602.10101, 2026

Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Robo3r: Enhanc- ing robotic manipulation with accurate feed-forward 3d reconstruction.arXiv preprint arXiv:2602.10101, 2026

Pith/arXiv arXiv 2026

[59] [59]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023

[60] [60]

STheReO: Stereo thermal dataset for research in odometry and mapping

Seungsang Yun, Minwoo Jung, Jeongyun Kim, Sangwoo Jung, Younghun Cho, Myung-Hwan Jeon, Giseop Kim, and Ayoung Kim. STheReO: Stereo thermal dataset for research in odometry and mapping. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3857–3864. IEEE, 2022

2022

[61] [61]

MonST3R: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024

Pith/arXiv arXiv 2024

[62] [62]

Multimodal fusion on low-quality data: A comprehensive survey.Information Fusion, page 104437, 2026

Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Qinghua Hu, Cheng Deng, Cai Xu, Jie Wen, Di Hu, et al. Multimodal fusion on low-quality data: A comprehensive survey.Information Fusion, page 104437, 2026

2026

[63] [63]

FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025

2025

[64] [64]

MonoTher-Depth: Enhancing thermal depth estimation via confidence-aware distillation.IEEE Robotics and Automation Letters, 10(3):2830–2837, 2025

Xingxing Zuo, Nikhil Ranganathan, Connor Lee, Georgia Gkioxari, and Soon-Jo Chung. MonoTher-Depth: Enhancing thermal depth estimation via confidence-aware distillation.IEEE Robotics and Automation Letters, 10(3):2830–2837, 2025. 13 A Technical appendices and supplementary material A.1 DarkVGGT: detailed methodology LoRA and camera tokens.After loading the...

2025