Boundary-Aware Uncertainty Quantification for Wildfire Spread Prediction
Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3
The pith
A Fire-Centered Evaluation Region framework shows that a distilled student model achieves comparable uncertainty calibration to ensembles in wildfire boundary zones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining the Fire-Centered Evaluation Region framework as a way to focus uncertainty evaluation on critical fire zones, the comparison on the wildfire dataset reveals that the distilled student model delivers calibration levels comparable to those of the ensemble while providing complementary uncertainty information in boundary-relevant regimes.
What carries the argument
The Fire-Centered Evaluation Region (FCER) framework, which conditions uncertainty quantification evaluation on spatially relevant fire-centered areas to prioritize operational relevance over global statistics.
Load-bearing premise
That focusing evaluation on fire-centered regions yields a more operationally relevant measure of uncertainty quantification without introducing new biases or needing separate validation.
What would settle it
An experiment where models ranked highly by FCER perform worse in actual wildfire boundary prediction accuracy compared to those favored by global metrics, or where the student model's calibration advantage disappears on independent test fires.
Figures
read the original abstract
Reliable wildfire spread prediction is vital for risk-aware emergency planning, yet most deep learning models lack principled uncertainty quantification (UQ). Further, for boundary-sensitive cases like wildfire spread, evaluating models with global metrics alone is often insufficient. To shift the focus of UQ evaluation toward a more operationally relevant approach, the Fire-Centered Evaluation Region (FCER) framework is introduced as a spatially conditioned protocol to characterize UQ within critical fire zones. Using FCER, an Ensemble is compared against an distilled single-pass student model on the WildfireSpreadTS dataset. The student model demonstrates comparable calibration and complementary uncertainty ranking in boundary-relevant regimes. Code is available at https://github.com/jonasvilhofunk/WildfireUQ-FCER
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the Fire-Centered Evaluation Region (FCER) framework as a spatially conditioned protocol for more operationally relevant uncertainty quantification (UQ) evaluation in wildfire spread prediction, focusing on critical fire zones. Using this, it compares an ensemble model to a distilled single-pass student model on the WildfireSpreadTS dataset, reporting that the student achieves comparable calibration and complementary uncertainty ranking in boundary-relevant regimes. Open code is provided.
Significance. The work addresses an important gap in UQ for boundary-sensitive predictions in wildfire modeling, which is crucial for emergency planning. The FCER idea and the model distillation for efficient UQ have potential significance if properly validated. Credit is given for providing code. However, the absence of methodological details and quantitative results reduces the current impact.
major comments (3)
- The abstract and manuscript introduce FCER as a 'spatially conditioned protocol' but supply no construction details such as region radius, fire-mask threshold, or boundary extraction method, nor any ablation on these choices. This is load-bearing for the claim that FCER provides a less biased, more relevant characterization than global metrics.
- The comparison of ensemble vs. student model under FCER lacks any quantitative metrics, error bars, statistical tests, or dataset details to support the 'comparable calibration' and 'complementary uncertainty ranking' claims. This makes the central empirical result unverifiable.
- No head-to-head comparison is shown demonstrating that FCER produces rankings or calibration scores that diverge from global ECE/NLL in operationally meaningful ways for wildfire spread. The operational relevance is asserted without evidence.
minor comments (2)
- Typo in abstract: 'an distilled' should be 'a distilled'.
- The notation and exact definition of FCER could be clarified with a formal equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential significance of the FCER framework for operationally relevant UQ in wildfire spread prediction. We address each major comment below and will incorporate the suggested additions to improve clarity, verifiability, and evidence for operational relevance.
read point-by-point responses
-
Referee: The abstract and manuscript introduce FCER as a 'spatially conditioned protocol' but supply no construction details such as region radius, fire-mask threshold, or boundary extraction method, nor any ablation on these choices. This is load-bearing for the claim that FCER provides a less biased, more relevant characterization than global metrics.
Authors: We agree that the construction details of FCER are critical for reproducibility and to support claims of reduced bias relative to global metrics. In the revised manuscript we will expand the Methods section with explicit specifications: region radius of 5 pixels centered on fire pixels, fire-mask threshold of 0.5 for binary segmentation, and boundary extraction via Canny edge detection on the fire mask followed by dilation. We will also add an ablation study varying radius (3–7 pixels) and threshold (0.4–0.6), reporting resulting changes in ECE and uncertainty ranking to demonstrate robustness of the operational relevance claim. revision: yes
-
Referee: The comparison of ensemble vs. student model under FCER lacks any quantitative metrics, error bars, statistical tests, or dataset details to support the 'comparable calibration' and 'complementary uncertainty ranking' claims. This makes the central empirical result unverifiable.
Authors: We acknowledge that the current presentation of results is primarily qualitative. The revised version will report quantitative ECE and NLL values under FCER for both models, with error bars from five independent training runs. We will include paired t-test p-values to assess statistical significance of differences and expand the dataset description to specify the WildfireSpreadTS train/test split (70/30), total frames (approximately 12,000), and preprocessing steps. These additions will render the comparability and ranking claims directly verifiable. revision: yes
-
Referee: No head-to-head comparison is shown demonstrating that FCER produces rankings or calibration scores that diverge from global ECE/NLL in operationally meaningful ways for wildfire spread. The operational relevance is asserted without evidence.
Authors: We will add a dedicated subsection and figure that directly contrasts FCER-derived rankings and calibration scores against global ECE/NLL. The new analysis will highlight specific wildfire instances where global metrics yield similar model rankings but FCER identifies substantially higher boundary uncertainty in the fire-front region, which is directly relevant to emergency evacuation planning. This will supply concrete evidence that the divergence has operational implications beyond what global metrics capture. revision: yes
Circularity Check
No circularity: empirical comparison on external dataset with no derivations or self-referential reductions.
full rationale
The manuscript introduces the FCER protocol as a new spatially conditioned evaluation method and reports an empirical head-to-head comparison of an ensemble versus a distilled student model on the WildfireSpreadTS dataset. No equations, derivations, fitted parameters, or uniqueness theorems appear in the provided text. The central claim rests on observed calibration and uncertainty-ranking behavior under FCER rather than any reduction of outputs to inputs by construction. Self-citation is absent from the load-bearing steps, and the protocol definition does not tautologically presuppose its own superiority.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ensemble and knowledge-distillation methods produce well-calibrated uncertainty estimates for spatiotemporal wildfire data.
invented entities (1)
-
Fire-Centered Evaluation Region (FCER)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Nerilie J Abram, Benjamin J Henley, Alex Sen Gupta, Tanya JR Lippmann, Hamish Clarke, Andrew J Dowdy, Jason J Sharples, Rachael H Nolan, Tianran Zhang, Martin J Wooster, et al. Connections of climate change and variability to large and extreme forest fires in southeast australia.Communications Earth & Environment, 2(1), 2021
work page 2021
-
[2]
Wildfires on a changing planet.Nature Communications, 17(1), 2026
Olivia Haas, Iain Colin Prentice, and Sandy P Harrison. Wildfires on a changing planet.Nature Communications, 17(1), 2026
work page 2026
-
[3]
Machine learning and deep learning for wildfire spread prediction: A review.Fire, 7(12), 2024
Henintsoa S Andrianarivony and Moulay A Akhloufi. Machine learning and deep learning for wildfire spread prediction: A review.Fire, 7(12), 2024
work page 2024
-
[4]
WildfireSpreadTS: A dataset of multi- modal time series for wildfire spread prediction
Sebastian Gerard, Yu Zhao, and Josephine Sullivan. WildfireSpreadTS: A dataset of multi- modal time series for wildfire spread prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[5]
Spatial uncertainty quantification in wildfire forecasting for climate- resilient emergency planning
Aditya Chakravarty. Spatial uncertainty quantification in wildfire forecasting for climate- resilient emergency planning. InNeurIPS Workshop on Tackling Climate Change with Machine Learning, 2025
work page 2025
-
[6]
Spatially-aware evaluation of segmentation uncertainty
Tal Zeevi, Eléonore V Lieffrig, Lawrence H Staib, and John A Onofrey. Spatially-aware evaluation of segmentation uncertainty. InCVPR 4th Workshop on Uncertainty Quantification for Computer Vision, 2025
work page 2025
-
[7]
Steven Landgraf, Kira Wursthorn, Markus Hillemann, and Markus Ulrich. Dudes: Deep uncertainty distillation using ensembles for semantic segmentation.PFG – Journal of Pho- togrammetry, Remote Sensing and Geoinformation Science, 92(2), 2024
work page 2024
-
[8]
Improved wildfire spread prediction with time-series data and the WSTS+ benchmark
Saad Lahrichi, Jake Bova, Jesse Johnson, and Jordan Malof. Improved wildfire spread prediction with time-series data and the WSTS+ benchmark. InWinter Conference on Applications of Computer Vision (WACV), 2026
work page 2026
-
[9]
Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks
Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks. InInternational Conference on Computer Vision (ICCV), 2021
work page 2021
-
[10]
Individual comparisons by ranking methods.Biometrics bulletin, 1(6), 1945
Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6), 1945. 5 A Supplementary Material This appendix is organized as follows. Section A.1 defines all metrics used in the paper. Section A.2 provides additional FCER sweep plots omitted from the main paper for space, including per-year AUROC and AUPRC curves as well as mean cal...
work page 1945
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.