Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
Pith reviewed 2026-05-24 17:04 UTC · model grok-4.3
The pith
A gated feature-wise transform layer conditions a CNN on edge priors to improve thermal image semantic segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a novel network architecture, called edge-conditioned convolutional neural network (EC-CNN), for thermal image semantic segmentation. Particularly, we elaborately design a gated feature-wise transform layer in EC-CNN to adaptively incorporate edge prior knowledge. The whole EC-CNN is end-to-end trained, and can generate high-quality segmentation results with the edge guidance. Meanwhile, we also introduce a new benchmark dataset named Segment Objects in Day And night (SODA) for comprehensive evaluations in thermal image semantic segmentation.
What carries the argument
Gated feature-wise transform layer that adaptively incorporates edge prior knowledge into the convolutional features.
Load-bearing premise
Adaptively incorporating edge prior knowledge via the gated feature-wise transform layer will consistently improve segmentation accuracy on thermal images without introducing new failure modes or requiring scene-specific tuning.
What would settle it
Running the EC-CNN and a baseline CNN without the gated edge layer on the SODA test set and finding that the edge-conditioned version does not produce higher accuracy or introduces more errors.
Figures
read the original abstract
Despite much research progress in image semantic segmentation, it remains challenging under adverse environmental conditions caused by imaging limitations of visible spectrum. While thermal infrared cameras have several advantages over cameras for the visible spectrum, such as operating in total darkness, insensitive to illumination variations, robust to shadow effects and strong ability to penetrate haze and smog. These advantages of thermal infrared cameras make the segmentation of semantic objects in day and night. In this paper, we propose a novel network architecture, called edge-conditioned convolutional neural network (EC-CNN), for thermal image semantic segmentation. Particularly, we elaborately design a gated feature-wise transform layer in EC-CNN to adaptively incorporate edge prior knowledge. The whole EC-CNN is end-to-end trained, and can generate high-quality segmentation results with the edge guidance. Meanwhile, we also introduce a new benchmark dataset named "Segment Objects in Day And night"(SODA) for comprehensive evaluations in thermal image semantic segmentation. SODA contains over 7,168 manually annotated and synthetically generated thermal images with 20 semantic region labels and from a broad range of viewpoints and scene complexities. Extensive experiments on SODA demonstrate the effectiveness of the proposed EC-CNN against the state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a novel edge-conditioned CNN (EC-CNN) architecture for semantic segmentation of thermal images captured in day and night conditions. It introduces a gated feature-wise transform layer designed to adaptively incorporate edge prior knowledge into the network. The model is trained end-to-end and evaluated on a newly introduced benchmark dataset SODA containing over 7,168 manually annotated and synthetically generated thermal images with 20 semantic labels across varied viewpoints and complexities, claiming superior performance over state-of-the-art methods.
Significance. If the central claims hold, the work contributes a new architectural component for fusing edge priors in thermal segmentation and releases a dedicated benchmark dataset for day/night thermal semantic segmentation, which addresses a gap in adverse-condition vision benchmarks and could support further research on robust segmentation under illumination extremes.
major comments (2)
- [§3] §3 (method description of gated feature-wise transform layer): the claim that this layer 'adaptively incorporate[s] edge prior knowledge' and yields 'high-quality segmentation results with the edge guidance' is load-bearing for the central contribution, yet the manuscript provides no equations defining the gate operation, no derivation showing how it avoids propagating unreliable edges (e.g., in uniform-temperature scenes), and no analysis of failure modes when edge detection quality is low.
- [§4] §4 (experiments on SODA): no ablation isolating the gated feature-wise transform layer is reported, so it is impossible to determine whether observed gains over baselines are attributable to this component or to other architectural choices; this directly undermines verification of the 'end-to-end trained' advantage asserted in the abstract.
minor comments (2)
- [Dataset description] Table 1 or dataset section: clarify the split between real and synthetically generated images in SODA and report separate metrics for each subset to allow readers to assess domain gap.
- [Qualitative results] Figure 3 or 4 (qualitative results): add failure-case examples where edge priors are noisy to illustrate the robustness of the gated layer.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity of the method and the strength of the experimental validation. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (method description of gated feature-wise transform layer): the claim that this layer 'adaptively incorporate[s] edge prior knowledge' and yields 'high-quality segmentation results with the edge guidance' is load-bearing for the central contribution, yet the manuscript provides no equations defining the gate operation, no derivation showing how it avoids propagating unreliable edges (e.g., in uniform-temperature scenes), and no analysis of failure modes when edge detection quality is low.
Authors: We agree that the manuscript lacks explicit equations and analysis for the gated feature-wise transform layer. In the revised version, we will add the mathematical formulation defining the gate operation and the adaptive incorporation of edge priors. We will also include an explanation of how the gating mechanism can suppress unreliable edges (e.g., via learned modulation in uniform-temperature scenes) along with a discussion of failure modes when edge detection quality is low. revision: yes
-
Referee: [§4] §4 (experiments on SODA): no ablation isolating the gated feature-wise transform layer is reported, so it is impossible to determine whether observed gains over baselines are attributable to this component or to other architectural choices; this directly undermines verification of the 'end-to-end trained' advantage asserted in the abstract.
Authors: We acknowledge the absence of an ablation study isolating the gated feature-wise transform layer. In the revision, we will add such an ablation, comparing the full EC-CNN against a variant without the gated layer (while keeping other elements fixed) on the SODA dataset to demonstrate the specific contribution of this component. revision: yes
Circularity Check
No circularity: empirical architecture proposal with independent evaluation
full rationale
The paper proposes EC-CNN, a new CNN architecture incorporating a gated feature-wise transform layer to fuse edge priors for thermal semantic segmentation, and introduces the SODA dataset for evaluation. The abstract and provided text contain no equations, derivations, or fitted parameters that reduce the claimed performance gains to quantities defined by the method itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central claim rests on end-to-end training and experimental results on the new benchmark, which are externally falsifiable and not constructed by redefinition of inputs. This is a standard empirical contribution without circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- gated feature-wise transform parameters
axioms (1)
- domain assumption Edge prior knowledge can be adaptively fused into CNN features to improve thermal segmentation accuracy
Reference graph
Works this paper leans on
-
[1]
Learning collaborative sparse representation for grayscale-thermal tracking,
C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016
work page 2016
-
[2]
Cross- modal ranking with soft consistency and noisy labels for robust rgb-t tracking,
C. Li, C. Zhu, Y . Huang, J. Tang, and L. Wang, “Cross- modal ranking with soft consistency and noisy labels for robust rgb-t tracking,” in European Conference on Computer Vision, 2018
work page 2018
-
[3]
Multispectral pedestrian detection: Benchmark dataset and baseline,
S. Hwang, J. Park, N. Kim, Y . Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2015
work page 2015
-
[4]
Learning cross-modal deep representations for robust pedestrian detection,
D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe, “Learning cross-modal deep representations for robust pedestrian detection,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2017
work page 2017
-
[5]
Rgb- infrared cross-modality person re-identification,
A. Wu, W.-S. Zheng, H. Yu, S. Gong, and J. Lai, “Rgb- infrared cross-modality person re-identification,” in Pro- ceedings of IEEE International Conference on Computer Vision, 2017
work page 2017
-
[6]
Rethinking Atrous Convolution for Semantic Image Segmentation
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image seg- mentation,” arXiv preprint arXiv:1706.05587 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 4, pp. 834–848, 2018
work page 2018
-
[8]
Synthetic data generation for end-to-end thermal infrared tracking,
L. Zhang, A. Gonzalez-Garcia, J. van de Weijer, M. Danelljan, and F. S. Khan, “Synthetic data generation for end-to-end thermal infrared tracking,” IEEE Transac- tions on Image Processing, vol. 28, no. 4, pp. 1837–1850, 2019
work page 2019
-
[9]
High-resolution image synthesis and semantic manipulation with conditional gans,
T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018
work page 2018
-
[10]
Fully convolu- tional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- tional networks for semantic segmentation,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015
work page 2015
-
[11]
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
U-net: Con- volutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Con- volutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention , 2015
work page 2015
-
[13]
Attention to scale: Scale-aware semantic image IEEE TRANSACTIONS ON XXX 11 segmentation,
L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image IEEE TRANSACTIONS ON XXX 11 segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016
work page 2016
-
[14]
Understanding convolution for semantic segmentation,
P. Wang, P. Chen, Y . Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in IEEE Winter Conference on Applica- tions of Computer Vision , 2018
work page 2018
-
[15]
Efficient inference in fully connected crfs with gaussian edge potentials,
P. Krahenbuhl and V . Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in Neural Information Processing Systems , 2011
work page 2011
-
[16]
L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille, “Semantic image segmentation with task- specific edge detection using cnns and a discriminatively trained domain transform,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016
work page 2016
-
[17]
Pushing the Boundaries of Boundary Detection using Deep Learning
I. Kokkinos, “Pushing the boundaries of bound- ary detection using deep learning,” arXiv preprint arXiv:1511.07386, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
D. Cheng, G. Meng, S. Xiang, and C. Pan, “Fusionnet: Edge aware deep convolutional networks for semantic segmentation of remote sensing harbor images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 10, no. 12, pp. 5769–5783, 2017
work page 2017
-
[19]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S. Ioffe and C. Szegedy, “Batch normalization: Accelerat- ing deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
A learned representation for artistic style,
V . Dumoulin, J. Shlens, and M. Kudlur, “A learned representation for artistic style,” 2017
work page 2017
-
[21]
Arbitrary style transfer in real-time with adaptive instance normalization,
X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017
work page 2017
-
[22]
Modulating early visual processing by language,
H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville, “Modulating early visual processing by language,” in Advances in Neural Information Processing Systems , 2017
work page 2017
-
[23]
FiLM: Visual Reasoning with a General Conditioning Layer
E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” arXiv preprint arXiv:1709.07871 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Recovering realistic texture in image super-resolution by deep spatial feature transform,
X. Wang, K. Yu, and C. D. andChen Change Loy, “Recovering realistic texture in image super-resolution by deep spatial feature transform,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015
work page 2015
-
[25]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems , 2014
work page 2014
-
[26]
Deep generative image models using a laplacian pyramid of adversarial networks,
E. L. Denton, S. Chintala, R. Fergus et al. , “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in Neural Information Processing Systems, 2015
work page 2015
-
[27]
Invertible Conditional GANs for image editing
G. Perarnau, J. Van De Weijer, B. Raducanu, and J. M. ´Alvarez, “Invertible conditional gans for image editing,” arXiv preprint arXiv:1611.06355 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
Improved techniques for train- ing gans,
T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen, “Improved techniques for train- ing gans,” in Advances in Neural Information Processing Systems, 2016
work page 2016
-
[29]
Conditional Generative Adversarial Nets
M. Mirza and S. Osindero, “Conditional generative ad- versarial nets,” arXiv preprint arXiv:1411.1784 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Image-to- image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to- image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017
work page 2017
-
[31]
Holistically-nested edge detection,
S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015
work page 2015
-
[32]
Learning to detect natural image boundaries using local brightness, color, and texture cues,
D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color, and texture cues,” IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 26
-
[33]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Ppattern Recognition, 2016
work page 2016
-
[34]
Learning Visual Reasoning Without Strong Priors
E. Perez, H. de Vries, F. Strub, V . Dumoulin, and A. Courville, “Learning visual reasoning without strong priors,” arXiv:1707.03017, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
The pascal visual object classes challenge: A retrospective,
M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision , vol. 111, no. 1, pp. 98–136, 2015
work page 2015
-
[36]
The role of con- text for object detection and semantic segmentation in the wild,
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of con- text for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2014
work page 2014
-
[37]
Detect what you can: Detecting and repre- senting objects using holistic models and body parts,
X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, “Detect what you can: Detecting and repre- senting objects using holistic models and body parts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2014
work page 2014
-
[38]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European confer- ence on computer vision . Springer, 2014, pp. 740–755
work page 2014
-
[39]
The cityscapes dataset for semantic urban scene under- standing,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene under- standing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016
work page 2016
-
[40]
Semantic object classes in video: A high-definition ground truth database,
G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters , vol. 30, no. 2, pp. 88–97, 2009
work page 2009
-
[41]
Road scene segmentation from a single image,
J. M. Alvarez, T. Gevers, Y . LeCun, and A. M. Lopez, “Road scene segmentation from a single image,” in European Conference on Computer Vision , 2012
work page 2012
-
[42]
Unsupervised image trans- formation for outdoor semantic labelling,
G. Ros and J. M. Alvarez, “Unsupervised image trans- formation for outdoor semantic labelling,” in 2015 IEEE Intelligent V ehicles Symposium (IV) , 2015. IEEE TRANSACTIONS ON XXX 12
work page 2015
-
[43]
Nonparametric scene parsing: Label transfer via dense scene alignment,
C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009
work page 2009
-
[44]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision , 2012
work page 2012
-
[45]
Sun3d: A database of big spaces reconstructed using sfm and object labels,
J. Xiao, A. Owens, and A. Torralba, “Sun3d: A database of big spaces reconstructed using sfm and object labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2013
work page 2013
-
[46]
Sun rgb- d: A rgb-d scene understanding benchmark suite,
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb- d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015
work page 2015
-
[47]
Rgb-t object tracking: benchmark and baseline,
C. Li, X. Liang, Y . Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: benchmark and baseline,” Pattern Recog- nition, 2019
work page 2019
-
[48]
Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,
E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Ar- royo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems , vol. 19, no. 1, pp. 263–272, 2018
work page 2018
-
[49]
Large kernel matters iimprove semantic segmentation by global convolutional network,
C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters iimprove semantic segmentation by global convolutional network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.