Augmentation techniques for video surveillance in the visible and thermal spectral range

Ann-Kristin Grosselfinger; David Munch; Michael Arens; Vanessa Buhrmester

arxiv: 2606.13042 · v2 · pith:BNLUB7G7new · submitted 2026-06-11 · 💻 cs.AI · cs.CV

Augmentation techniques for video surveillance in the visible and thermal spectral range

Vanessa Buhrmester , Ann-Kristin Grosselfinger , David Munch , Michael Arens This is my paper

Pith reviewed 2026-06-27 06:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords data augmentationmultispectral object detectionCNNvisible spectrumthermal infraredvideo surveillancesensor differencesrobustness

0 comments

The pith

Augmentation techniques on visible images can enhance CNN performance for object detection in both visible and thermal infrared surveillance footage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores the use of data augmentation to address the challenges in training convolutional neural networks for multispectral object detection in video surveillance systems that combine visible and long-wave infrared cameras. Visible images provide color and texture information but are affected by illumination variations, while thermal images capture radiation but lack those details, and sufficient thermal datasets are hard to obtain. The authors examine how augmentation techniques can simulate variations in thermal radiation, shape, and color to make models trained on visible data more robust when dealing with thermal or mixed inputs. A sympathetic reader would care because this approach could allow better utilization of abundant visible data for systems that must operate day and night. The investigation provides insight into what CNNs learn from different sensor types.

Core claim

The paper claims that by applying augmentation techniques primarily to visible spectral range data, the suitability and robustness of CNNs for multispectral object detection can be improved, mitigating the effects of differences in color, texture, and thermal radiation information between the two spectral ranges.

What carries the argument

Data augmentation techniques that simulate thermal radiation effects and other variations when applied to visible images for training CNNs in object detection tasks.

If this is right

Models trained with augmented visible data show improved accuracy on thermal infrared images.
Augmentation helps address problems like varying illumination and sensor specialties.
CNNs gain better decision-making capabilities across different sensor inputs.
Training on visible data becomes more advantageous for evaluating mixed visible and infrared data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar augmentation methods might apply to other sensor modalities beyond visible and thermal.
Further research could test these techniques on real-world continuous surveillance datasets.
Combining this with actual thermal data augmentation could yield even stronger results.

Load-bearing premise

That variations in thermal radiation, shape, and color can be meaningfully simulated using standard augmentation techniques on visible data to affect classification accuracy.

What would settle it

A direct comparison experiment where a CNN trained without the proposed augmentations outperforms or matches the augmented version on thermal test data would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2606.13042 by Ann-Kristin Grosselfinger, David Munch, Michael Arens, Vanessa Buhrmester.

**Figure 2.** Figure 2: More examples for advantages and disadvantages in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Sometimes the more visible spectrum for humans is not the one with the better detection. Red box: Detection in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: YOLOv3 Detector on MILtokyo Dataset: Confidence values in [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Images of the ThermalWorld Dataset, including humans, cars, cats, bus, humans, building, each in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Filterkernels of the first layer, car, and building while training with the following augmentation techniques, from [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical investigation into visible-to-thermal augmentation for surveillance CNNs, but standard augmentations cannot address the underlying radiation physics.

read the letter

The paper examines whether common augmentation techniques applied to visible images can improve CNN robustness for multispectral object detection when thermal infrared data is also involved. It frames the work as an empirical look at data scarcity for thermal datasets and the potential to supplement with visible daytime recordings.

It does a reasonable job laying out the practical surveillance setting and noting the differences in information content between the two modalities. The authors correctly flag that there is no clear prior evidence on how thermal radiation, shape, and color variations affect accuracy, which keeps the scope honest.

The main limitation is that the approach rests on the assumption that RGB-style augmentations (color jitter, brightness, geometric transforms) can meaningfully mitigate differences driven by thermal emission. Thermal LWIR encodes emitted radiance per Planck's law, independent of reflected visible light. Those augmentations change appearance statistics but do not simulate the spectral physics, so any measured gains are likely to reflect generic robustness rather than domain-gap closure. The abstract itself undercuts stronger claims by admitting the lack of evidence on thermal influence.

This is the kind of paper that might interest engineers building multispectral surveillance systems who need to test augmentation baselines. It is not positioned as a fundamental advance. If the full manuscript contains controlled experiments with proper baselines and reports the actual performance deltas, it is worth sending to referees; otherwise the contribution stays thin. I would not cite it in my own work unless the results section shows something quantitatively surprising.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates augmentation techniques to enhance the suitability and robustness of CNNs for multispectral object detection in video surveillance, combining visible spectral range data (with color and texture) and long-wave infrared (thermal) data. It argues that training on augmented visible imagery can be advantageous when thermal datasets are limited, despite differences in information content, and seeks to clarify how variations in thermal radiation, shape, and color affect classification accuracy.

Significance. If the results establish that specific augmentations meaningfully close the domain gap and improve cross-spectral performance beyond generic regularization, the work would offer practical value for day-night surveillance systems by reducing dependence on scarce thermal training data. The emphasis on understanding CNN decision-making across sensors is a positive direction, though the physical distinction between reflected visible light and emitted thermal radiance (governed by temperature and emissivity) limits the expected transferability of standard RGB augmentations.

major comments (2)

[Abstract] Abstract: The text states there is 'no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy,' yet the central investigation into augmentation techniques does not outline a concrete methodology (e.g., controlled ablations or physics-informed metrics) to isolate thermal-radiation effects from generic robustness gains; this leaves the motivation for visible-only augmentations ungrounded.
[Abstract] Abstract (weakest assumption): Standard augmentations such as color jitter, brightness, or contrast operate on RGB reflectance statistics and cannot reproduce the emitted radiance physics of LWIR imagery (Planck's law dependence on temperature and emissivity, independent of visible illumination); any reported performance improvement therefore risks being confounded with non-specific regularization rather than domain-gap closure.

minor comments (2)

[Abstract] Abstract: The phrasing 'More accurate, our task is multispectral CNN-based object detection' is awkward and should be revised to 'More precisely...' for clarity.
[Abstract] Abstract: The final sentence is truncated ('we investigate the suitability and robustness of different augmentation techniques...'); the full manuscript should ensure the abstract provides a complete overview of the approach and any key findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below, agreeing to revisions that improve clarity on methodology and physical assumptions while defending the empirical scope of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The text states there is 'no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy,' yet the central investigation into augmentation techniques does not outline a concrete methodology (e.g., controlled ablations or physics-informed metrics) to isolate thermal-radiation effects from generic robustness gains; this leaves the motivation for visible-only augmentations ungrounded.

Authors: The manuscript reports a series of experiments applying augmentation techniques to visible imagery and evaluating cross-spectral performance on thermal data, including comparisons across augmentation types to assess effects on detection accuracy. We agree the abstract would benefit from explicitly summarizing this design. We will revise the abstract to describe the controlled experiments and ablation-style comparisons used to investigate influences on classification accuracy. revision: yes
Referee: [Abstract] Abstract (weakest assumption): Standard augmentations such as color jitter, brightness, or contrast operate on RGB reflectance statistics and cannot reproduce the emitted radiance physics of LWIR imagery (Planck's law dependence on temperature and emissivity, independent of visible illumination); any reported performance improvement therefore risks being confounded with non-specific regularization rather than domain-gap closure.

Authors: We fully recognize that RGB augmentations cannot model LWIR emission physics. The study is an empirical evaluation of whether such augmentations nonetheless yield robustness benefits for multispectral detection under limited thermal data. Results show measurable improvements, interpreted as regularization aiding domain shift handling. We will revise the manuscript to explicitly discuss the physical mismatch and clarify that gains are not presented as physics-based domain closure, addressing potential confounding by providing interpretive context. revision: partial

Circularity Check

0 steps flagged

No derivation chain present; empirical augmentation study is self-contained

full rationale

The manuscript is an empirical investigation of standard image augmentation techniques applied to visible and thermal imagery for CNN-based object detection. It contains no equations, parameter fits, predictions derived from fitted inputs, or load-bearing self-citations. Claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction. The work therefore exhibits no circularity and is evaluated against external data and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities identifiable from the abstract.

pith-pipeline@v0.9.1-grok · 5783 in / 799 out tokens · 14642 ms · 2026-06-27T06:34:22.803157+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Multispectral object detection for autonomous vehicles,

Takumi, K., Watanabe, K., Ha, Q., Tejero-De-Pablos, A., Ushiku, Y ., and Harada, T., “Multispectral object detection for autonomous vehicles,” in [Proceedings of the on Thematic Workshops of ACM Multimedia], (2017)

2017
[2]

Statistics of infrared images,

Morris, N. J., Avidan, S., Matusik, W., and Pfister, H., “Statistics of infrared images,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], (2007)

2007
[3]

Cats: A color and thermal stereo benchmark,

Treible, W., Saponaro, P., Sorensen, S., Kolagunda, A., O’Neal, M., Phelan, B., Sherbondy, K., and Kambhamettu, C., “Cats: A color and thermal stereo benchmark,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], (2017)

2017
[4]

Multiple-instance pruning for learning efficient cascade detectors,

Zhang, C. and Viola, P. A., “Multiple-instance pruning for learning efficient cascade detectors,” in [Advances in neural information processing systems], (2008)

2008
[5]

A comparative analysis of face recognition performance with visible and thermal infrared imagery,

Socolinsky, D. A. and Selinger, A., “A comparative analysis of face recognition performance with visible and thermal infrared imagery,” in [Object recognition supported by user interaction for service robots], IEEE (2002)

2002
[6]

Learning transmodal person detectors from single spectral training sets,

Kieritz, H., H ¨ubner, W., and Arens, M., “Learning transmodal person detectors from single spectral training sets,” in [Security and Defence Conference SPIE], (2013)

2013
[7]

Deep perceptual mapping for thermal to visible face recognition,

Sarfraz, M. S. and Stiefelhagen, R., “Deep perceptual mapping for thermal to visible face recognition,” in [Proceed- ings of the British Machine Vision Conference], (2015)

2015
[8]

Fully convolutional region proposal networks for multispectral person detection,

K ¨onig, D., Adam, M., Jarvers, C., Layher, G., Neumann, H., and Teutsch, M., “Fully convolutional region proposal networks for multispectral person detection,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops], (2017)

2017
[9]

CNN-based thermal infrared person detection by domain adaptation,

Herrmann, C., Ruf, M., and Beyerer, J., “CNN-based thermal infrared person detection by domain adaptation,” in [Autonomous Systems: Sensors, V ehicles, Security, and the Internet of Everything], International Society for Optics and Photonics (2018)

2018
[10]

Evaluating the Impact of Color Information in Deep Neural Networks,

Buhrmester, V ., M¨unch, D., Bulatov, D., and Arens, M., “Evaluating the Impact of Color Information in Deep Neural Networks,” in [Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (ibPRIA)], (2019)

2019
[11]

Thermalgan: Multimodal color- to-thermal image translation for person re-identification in multispectral dataset,

Kniaz, V . V ., Knyaz, V . A., Hladuvka, J., Kropatsch, W. G., and Mizginov, V ., “Thermalgan: Multimodal color- to-thermal image translation for person re-identification in multispectral dataset,” in [Proceedings of the European Conference on Computer Vision (ECCV)], (2018)

2018
[12]

Improving neural networks by preventing co-adaptation of feature detectors

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R., “Improving neural networks by preventing co-adaptation of feature detectors,”arXiv preprint arXiv:1207.0580(2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012

[1] [1]

Multispectral object detection for autonomous vehicles,

Takumi, K., Watanabe, K., Ha, Q., Tejero-De-Pablos, A., Ushiku, Y ., and Harada, T., “Multispectral object detection for autonomous vehicles,” in [Proceedings of the on Thematic Workshops of ACM Multimedia], (2017)

2017

[2] [2]

Statistics of infrared images,

Morris, N. J., Avidan, S., Matusik, W., and Pfister, H., “Statistics of infrared images,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], (2007)

2007

[3] [3]

Cats: A color and thermal stereo benchmark,

Treible, W., Saponaro, P., Sorensen, S., Kolagunda, A., O’Neal, M., Phelan, B., Sherbondy, K., and Kambhamettu, C., “Cats: A color and thermal stereo benchmark,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition], (2017)

2017

[4] [4]

Multiple-instance pruning for learning efficient cascade detectors,

Zhang, C. and Viola, P. A., “Multiple-instance pruning for learning efficient cascade detectors,” in [Advances in neural information processing systems], (2008)

2008

[5] [5]

A comparative analysis of face recognition performance with visible and thermal infrared imagery,

Socolinsky, D. A. and Selinger, A., “A comparative analysis of face recognition performance with visible and thermal infrared imagery,” in [Object recognition supported by user interaction for service robots], IEEE (2002)

2002

[6] [6]

Learning transmodal person detectors from single spectral training sets,

Kieritz, H., H ¨ubner, W., and Arens, M., “Learning transmodal person detectors from single spectral training sets,” in [Security and Defence Conference SPIE], (2013)

2013

[7] [7]

Deep perceptual mapping for thermal to visible face recognition,

Sarfraz, M. S. and Stiefelhagen, R., “Deep perceptual mapping for thermal to visible face recognition,” in [Proceed- ings of the British Machine Vision Conference], (2015)

2015

[8] [8]

Fully convolutional region proposal networks for multispectral person detection,

K ¨onig, D., Adam, M., Jarvers, C., Layher, G., Neumann, H., and Teutsch, M., “Fully convolutional region proposal networks for multispectral person detection,” in [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops], (2017)

2017

[9] [9]

CNN-based thermal infrared person detection by domain adaptation,

Herrmann, C., Ruf, M., and Beyerer, J., “CNN-based thermal infrared person detection by domain adaptation,” in [Autonomous Systems: Sensors, V ehicles, Security, and the Internet of Everything], International Society for Optics and Photonics (2018)

2018

[10] [10]

Evaluating the Impact of Color Information in Deep Neural Networks,

Buhrmester, V ., M¨unch, D., Bulatov, D., and Arens, M., “Evaluating the Impact of Color Information in Deep Neural Networks,” in [Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (ibPRIA)], (2019)

2019

[11] [11]

Thermalgan: Multimodal color- to-thermal image translation for person re-identification in multispectral dataset,

Kniaz, V . V ., Knyaz, V . A., Hladuvka, J., Kropatsch, W. G., and Mizginov, V ., “Thermalgan: Multimodal color- to-thermal image translation for person re-identification in multispectral dataset,” in [Proceedings of the European Conference on Computer Vision (ECCV)], (2018)

2018

[12] [12]

Improving neural networks by preventing co-adaptation of feature detectors

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R., “Improving neural networks by preventing co-adaptation of feature detectors,”arXiv preprint arXiv:1207.0580(2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012