Supercharging Thermal Gaussian Splatting with Depth Estimation

Benjamin Busam; Chenxin Cai; Daniel Roth; Hannah Schieber; Manoj Biswanath

arxiv: 2605.30328 · v1 · pith:L7EGVTS3new · submitted 2026-05-28 · 💻 cs.CV

Supercharging Thermal Gaussian Splatting with Depth Estimation

Manoj Biswanath , Chenxin Cai , Hannah Schieber , Daniel Roth , Benjamin Busam This is my paper

Pith reviewed 2026-06-29 07:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords thermal imaginggaussian splattingdepth estimationnovel view synthesisradiance fieldssingle modalityinfrared3D reconstruction

0 comments

The pith

Thermal images with estimated depth can drive Gaussian splatting for radiance fields faster than multimodal baselines while matching or exceeding their rendering quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a single-modality pipeline using only thermal infrared images is sufficient to build high-quality 3D radiance fields via Gaussian splatting. Depth estimates derived directly from the thermal images supply the geometry needed to position the Gaussians, eliminating the need to fuse visible-light data. This matters for settings such as robotics and surveillance where thermal sensing is already present and where multimodal fusion adds latency and complexity. The reported results show the resulting Thermal-to-Depth Gaussian Splatting (TDg) method produces slightly higher average LPIPS, SSIM, and PSNR scores than the MSMG baseline on two test collections while cutting training time by more than half.

Core claim

The Thermal-to-Depth Gaussian Splatting (TDg) method constructs radiance fields from thermal images alone by first estimating depth maps from those images and then using the resulting 3D geometry to initialize and optimize 3D Gaussians. On the RGBT-Scenes and ThermalMix datasets, TDg records average gains of 1.12 percent in LPIPS, 0.034 percent in SSIM, and 0.01 percent in PSNR relative to the Multiple Single-Modal Gaussians baseline, accompanied by a training-time reduction of 12 minutes 47 seconds (55 percent).

What carries the argument

Thermal-to-Depth Gaussian Splatting (TDg), which substitutes depth estimation performed on thermal images for the geometry normally obtained from RGB or fused modalities.

If this is right

Thermal radiance fields can be built without any visible-light input, removing a dependency in low-light or obscured environments.
Training time reductions of roughly half make repeated reconstruction or online updates more practical for inspection and rescue tasks.
Heat-source identification in surveillance and industrial monitoring can rely on the same splatting pipeline used for geometric reconstruction.
The single-modality design lowers the computational overhead of multimodal fusion while preserving or improving perceptual image metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If thermal depth estimation proves reliable across wider temperature ranges, the same pipeline could be applied to other single-sensor modalities such as near-infrared or event cameras.
Pairing the method with lightweight real-time depth estimators could support continuous 3D mapping on mobile thermal-equipped platforms.
The observed speed advantage suggests that depth-guided single-modality splatting may scale more readily to larger scenes than fusion-based alternatives.

Load-bearing premise

Depth maps estimated from thermal images alone supply 3D geometry accurate enough to place Gaussians without introducing errors that lower final rendering quality.

What would settle it

On a held-out scene, renderings produced by TDg using estimated thermal depth show lower LPIPS or SSIM scores, or visible geometric artifacts, than renderings that use ground-truth depth or RGB-derived geometry.

Figures

Figures reproduced from arXiv: 2605.30328 by Benjamin Busam, Chenxin Cai, Daniel Roth, Hannah Schieber, Manoj Biswanath.

**Figure 2.** Figure 2: TDg architecture. A unified 3D Gaussian representation (center) is optimized via dual rasterization. By comparing the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Rendering image results on the dataset scenes: (a) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Building reconstruction: The front view was correctly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Thermal Gaussian Splatting via depth estimation shows speed gains but negligible quality improvements and untested depth accuracy.

read the letter

The punchline is that this paper adds a depth estimation network to Gaussian Splatting so it can work from thermal images only, and reports modest speed gains plus tiny quality improvements over one baseline.

It does show that you can drop the RGB channel and still get a working radiance field for thermal data. The training time cut of 55% is the clearest win, and they test on two datasets which is better than nothing. The method itself is a straightforward combination of existing pieces: a thermal depth predictor feeding into 3DGS.

The soft spots are bigger than the gains. The quality differences are 0.01% PSNR on average, which is not meaningful without error bars or multiple runs. The key assumption that depth from thermal is accurate enough for good Gaussian placement is not checked at all. No depth error numbers, no ablation turning the depth off, and thermal images are known to be hard for depth because they lack texture and have their own artifacts. If the depth is off by more than a small amount, the splatting will suffer and the small reported edge could disappear.

This paper is for specialists in thermal imaging for 3D or people extending novel view synthesis to non-RGB sensors. A general CV reader will not get much from it. The thinking is straightforward and they engage with the baseline they chose, but the results do not strongly support the main claim.

I would not bring this to a reading group. I would not cite it in the next year. It does not look ready for serious peer review because the evidence for the performance claim is too weak.

Referee Report

3 major / 1 minor

Summary. The paper proposes Thermal-to-Depth Gaussian Splatting (TDg), a single-modality approach that performs 3D Gaussian Splatting using only thermal images together with monocular depth estimates derived from those images. It reports that TDg outperforms the MSMG baseline on the RGBT-Scenes and ThermalMix datasets, with average gains of 1.12% in LPIPS, 0.034% in SSIM and 0.01% in PSNR, while also reducing training time by 12 min 47 s (55%). The work positions the method for applications in surveillance, search-and-rescue and industrial inspection where thermal data are primary.

Significance. If the performance claims and depth-accuracy assumption hold, the work would demonstrate that thermal-only radiance fields can be obtained more efficiently than multimodal baselines, with the reported training-time reduction being the most practically relevant outcome. The metric gains are too small to constitute a clear advance in rendering quality.

major comments (3)

[Abstract] Abstract: the reported improvements (1.12% LPIPS, 0.034% SSIM, 0.01% PSNR) are presented without error bars, standard deviations across scenes, or any statistical significance test; given that the PSNR delta is smaller than typical run-to-run variation in Gaussian Splatting, it is impossible to judge whether the central performance claim is supported.
[Method / Experiments] Method / Experiments: the central assumption that monocular depth estimation on thermal images alone supplies geometry accurate enough to avoid Gaussian placement errors is never quantified; no depth-prediction metrics (RMSE, AbsRel, etc.), no ablation that removes or replaces the depth branch, and no comparison against RGB-derived depth are provided, leaving the load-bearing premise untested.
[Experiments] Experiments: the only baseline is MSMG; without additional controls (e.g., thermal images with ground-truth depth, or depth estimators fine-tuned on thermal data), it is unclear whether any observed difference is attributable to the proposed depth integration or to other implementation choices.

minor comments (1)

[Abstract] Abstract: the parenthetical expansion of LPIPS is given but the acronym is not introduced on first use in the main text; consistent notation should be checked throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, with planned revisions where appropriate to strengthen the presentation of results and acknowledge limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the reported improvements (1.12% LPIPS, 0.034% SSIM, 0.01% PSNR) are presented without error bars, standard deviations across scenes, or any statistical significance test; given that the PSNR delta is smaller than typical run-to-run variation in Gaussian Splatting, it is impossible to judge whether the central performance claim is supported.

Authors: We agree that the absence of error bars and standard deviations limits interpretability, particularly given the small PSNR delta. In the revised manuscript we will report per-metric standard deviations computed across scenes on both datasets. We also note that the primary practical contribution is the 55% training-time reduction while achieving rendering quality that is at least comparable to the baseline. revision: yes
Referee: [Method / Experiments] Method / Experiments: the central assumption that monocular depth estimation on thermal images alone supplies geometry accurate enough to avoid Gaussian placement errors is never quantified; no depth-prediction metrics (RMSE, AbsRel, etc.), no ablation that removes or replaces the depth branch, and no comparison against RGB-derived depth are provided, leaving the load-bearing premise untested.

Authors: We acknowledge that direct validation of the depth estimates (metrics, ablations, or RGB comparison) is absent. The current evaluation centers on final rendering quality rather than intermediate depth accuracy. In the revision we will add an explicit limitations paragraph discussing this assumption and its potential effect on Gaussian placement. revision: partial
Referee: [Experiments] Experiments: the only baseline is MSMG; without additional controls (e.g., thermal images with ground-truth depth, or depth estimators fine-tuned on thermal data), it is unclear whether any observed difference is attributable to the proposed depth integration or to other implementation choices.

Authors: MSMG was selected because it is the closest published multimodal single-Gaussian baseline. Controls involving ground-truth depth or thermal-specific fine-tuning would require new data collection and training runs outside the scope of the present study. The reported efficiency gain remains directly attributable to the single-modality design. revision: no

Circularity Check

0 steps flagged

No significant circularity detected in claimed results.

full rationale

The paper presents TDg as an empirical method that applies monocular depth estimation to thermal images and compares rendering metrics (LPIPS, SSIM, PSNR) plus training time against the named MSMG baseline on RGBT-Scenes and ThermalMix. No equations, fitted parameters, or self-citations are shown that would make the reported metric deltas or speedups equivalent to the inputs by construction. The central claim remains an external empirical comparison whose validity depends on the (untested here) accuracy of the depth branch rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that thermal depth estimation is accurate enough for 3D reconstruction; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Depth estimation from thermal images alone yields geometry accurate enough for radiance field reconstruction
The method explicitly removes reliance on visible light and substitutes estimated depth to derive the fields.

pith-pipeline@v0.9.1-grok · 5839 in / 1206 out tokens · 33033 ms · 2026-06-29T07:48:34.955502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

Li, K., Masuda, M., Schmidt, S., Mori, S., 2025

Sad-gs: Shape-aligned depth-supervised gaussian splat- ting.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2842–2851. Li, K., Masuda, M., Schmidt, S., Mori, S., 2025. Radiance Fields in XR: A Survey on How Radiance Fields are Envisioned and Addressed for XR Research.IEEE Transactions on Visual- ization and Computer Grap...

work page arXiv 2025
[2]

Sch¨onberger, J

Dynamon: Motion-aware fast and robust camera local- ization for dynamic neural radiance fields.IEEE Robotics and Automation Letters. Sch¨onberger, J. L., Frahm, J.-M., 2016. Structure-from-motion revisited.Conference on Computer Vision and Pattern Recog- nition (CVPR). Shin, U., Lee, K., Lee, B.-U., Kweon, I. S., 2022. Maximiz- ing Self-Supervision From T...

2016
[3]

MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation.IEEE Robotics and Automa- tion Letters

[1] [1]

Li, K., Masuda, M., Schmidt, S., Mori, S., 2025

Sad-gs: Shape-aligned depth-supervised gaussian splat- ting.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2842–2851. Li, K., Masuda, M., Schmidt, S., Mori, S., 2025. Radiance Fields in XR: A Survey on How Radiance Fields are Envisioned and Addressed for XR Research.IEEE Transactions on Visual- ization and Computer Grap...

work page arXiv 2025

[2] [2]

Sch¨onberger, J

Dynamon: Motion-aware fast and robust camera local- ization for dynamic neural radiance fields.IEEE Robotics and Automation Letters. Sch¨onberger, J. L., Frahm, J.-M., 2016. Structure-from-motion revisited.Conference on Computer Vision and Pattern Recog- nition (CVPR). Shin, U., Lee, K., Lee, B.-U., Kweon, I. S., 2022. Maximiz- ing Self-Supervision From T...

2016

[3] [3]

MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation.IEEE Robotics and Automa- tion Letters