pith. sign in

arxiv: 2605.03148 · v2 · submitted 2026-05-04 · 💻 cs.CV

Boundary-Aware Uncertainty Quantification for Wildfire Spread Prediction

Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords wildfire spread predictionuncertainty quantificationboundary-aware evaluationmodel distillationensemble comparisoncalibrationdeep learning
0
0 comments X

The pith

A Fire-Centered Evaluation Region framework shows that a distilled student model achieves comparable uncertainty calibration to ensembles in wildfire boundary zones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a spatially conditioned evaluation protocol called the Fire-Centered Evaluation Region to assess uncertainty quantification in wildfire spread predictions by concentrating on critical fire areas rather than whole-image averages. It applies this protocol to compare an ensemble of models with a distilled single-pass student model trained on wildfire data. The student model exhibits similar calibration performance and offers distinct uncertainty rankings particularly in regions near fire boundaries. This approach matters because global metrics often overlook the operational importance of accurate uncertainty at the edges of spreading fires for emergency response.

Core claim

By defining the Fire-Centered Evaluation Region framework as a way to focus uncertainty evaluation on critical fire zones, the comparison on the wildfire dataset reveals that the distilled student model delivers calibration levels comparable to those of the ensemble while providing complementary uncertainty information in boundary-relevant regimes.

What carries the argument

The Fire-Centered Evaluation Region (FCER) framework, which conditions uncertainty quantification evaluation on spatially relevant fire-centered areas to prioritize operational relevance over global statistics.

Load-bearing premise

That focusing evaluation on fire-centered regions yields a more operationally relevant measure of uncertainty quantification without introducing new biases or needing separate validation.

What would settle it

An experiment where models ranked highly by FCER perform worse in actual wildfire boundary prediction accuracy compared to those favored by global metrics, or where the student model's calibration advantage disappears on independent test fires.

Figures

Figures reproduced from arXiv: 2605.03148 by Jonas V. Funk.

Figure 1
Figure 1. Figure 1: FCER sweep on WildfireSpreadTS averaged over 2018–2021 for Ensemble and DUDES view at source ↗
Figure 2
Figure 2. Figure 2: Representative uncertainty (unc) maps for a small fire (top) and a large fire (bottom) from view at source ↗
Figure 3
Figure 3. Figure 3: FCER sweep on WildfireSpreadTS for Ensemble and DUDES with UTAE ( view at source ↗
Figure 4
Figure 4. Figure 4: FCER sweep on WildfireSpreadTS for Ensemble and DUDES with UTAE ( view at source ↗
Figure 5
Figure 5. Figure 5: FCER sweep on WildfireSpreadTS for Ensemble and DUDES with UTAE ( view at source ↗
Figure 6
Figure 6. Figure 6: FCER sweep on WildfireSpreadTS averaged over 2018–2021 for Ensemble and DUDES view at source ↗
Figure 7
Figure 7. Figure 7: FCER sweep on WildfireSpreadTS averaged over 2018–2021 for Ensemble and DUDES view at source ↗
read the original abstract

Reliable wildfire spread prediction is vital for risk-aware emergency planning, yet most deep learning models lack principled uncertainty quantification (UQ). Further, for boundary-sensitive cases like wildfire spread, evaluating models with global metrics alone is often insufficient. To shift the focus of UQ evaluation toward a more operationally relevant approach, the Fire-Centered Evaluation Region (FCER) framework is introduced as a spatially conditioned protocol to characterize UQ within critical fire zones. Using FCER, an Ensemble is compared against an distilled single-pass student model on the WildfireSpreadTS dataset. The student model demonstrates comparable calibration and complementary uncertainty ranking in boundary-relevant regimes. Code is available at https://github.com/jonasvilhofunk/WildfireUQ-FCER

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce the Fire-Centered Evaluation Region (FCER) framework as a spatially conditioned protocol for more operationally relevant uncertainty quantification (UQ) evaluation in wildfire spread prediction, focusing on critical fire zones. Using this, it compares an ensemble model to a distilled single-pass student model on the WildfireSpreadTS dataset, reporting that the student achieves comparable calibration and complementary uncertainty ranking in boundary-relevant regimes. Open code is provided.

Significance. The work addresses an important gap in UQ for boundary-sensitive predictions in wildfire modeling, which is crucial for emergency planning. The FCER idea and the model distillation for efficient UQ have potential significance if properly validated. Credit is given for providing code. However, the absence of methodological details and quantitative results reduces the current impact.

major comments (3)
  1. The abstract and manuscript introduce FCER as a 'spatially conditioned protocol' but supply no construction details such as region radius, fire-mask threshold, or boundary extraction method, nor any ablation on these choices. This is load-bearing for the claim that FCER provides a less biased, more relevant characterization than global metrics.
  2. The comparison of ensemble vs. student model under FCER lacks any quantitative metrics, error bars, statistical tests, or dataset details to support the 'comparable calibration' and 'complementary uncertainty ranking' claims. This makes the central empirical result unverifiable.
  3. No head-to-head comparison is shown demonstrating that FCER produces rankings or calibration scores that diverge from global ECE/NLL in operationally meaningful ways for wildfire spread. The operational relevance is asserted without evidence.
minor comments (2)
  1. Typo in abstract: 'an distilled' should be 'a distilled'.
  2. The notation and exact definition of FCER could be clarified with a formal equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of the FCER framework for operationally relevant UQ in wildfire spread prediction. We address each major comment below and will incorporate the suggested additions to improve clarity, verifiability, and evidence for operational relevance.

read point-by-point responses
  1. Referee: The abstract and manuscript introduce FCER as a 'spatially conditioned protocol' but supply no construction details such as region radius, fire-mask threshold, or boundary extraction method, nor any ablation on these choices. This is load-bearing for the claim that FCER provides a less biased, more relevant characterization than global metrics.

    Authors: We agree that the construction details of FCER are critical for reproducibility and to support claims of reduced bias relative to global metrics. In the revised manuscript we will expand the Methods section with explicit specifications: region radius of 5 pixels centered on fire pixels, fire-mask threshold of 0.5 for binary segmentation, and boundary extraction via Canny edge detection on the fire mask followed by dilation. We will also add an ablation study varying radius (3–7 pixels) and threshold (0.4–0.6), reporting resulting changes in ECE and uncertainty ranking to demonstrate robustness of the operational relevance claim. revision: yes

  2. Referee: The comparison of ensemble vs. student model under FCER lacks any quantitative metrics, error bars, statistical tests, or dataset details to support the 'comparable calibration' and 'complementary uncertainty ranking' claims. This makes the central empirical result unverifiable.

    Authors: We acknowledge that the current presentation of results is primarily qualitative. The revised version will report quantitative ECE and NLL values under FCER for both models, with error bars from five independent training runs. We will include paired t-test p-values to assess statistical significance of differences and expand the dataset description to specify the WildfireSpreadTS train/test split (70/30), total frames (approximately 12,000), and preprocessing steps. These additions will render the comparability and ranking claims directly verifiable. revision: yes

  3. Referee: No head-to-head comparison is shown demonstrating that FCER produces rankings or calibration scores that diverge from global ECE/NLL in operationally meaningful ways for wildfire spread. The operational relevance is asserted without evidence.

    Authors: We will add a dedicated subsection and figure that directly contrasts FCER-derived rankings and calibration scores against global ECE/NLL. The new analysis will highlight specific wildfire instances where global metrics yield similar model rankings but FCER identifies substantially higher boundary uncertainty in the fire-front region, which is directly relevant to emergency evacuation planning. This will supply concrete evidence that the divergence has operational implications beyond what global metrics capture. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on external dataset with no derivations or self-referential reductions.

full rationale

The manuscript introduces the FCER protocol as a new spatially conditioned evaluation method and reports an empirical head-to-head comparison of an ensemble versus a distilled student model on the WildfireSpreadTS dataset. No equations, derivations, fitted parameters, or uniqueness theorems appear in the provided text. The central claim rests on observed calibration and uncertainty-ranking behavior under FCER rather than any reduction of outputs to inputs by construction. Self-citation is absent from the load-bearing steps, and the protocol definition does not tautologically presuppose its own superiority.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the newly introduced FCER protocol and standard deep-learning assumptions for ensemble and distillation-based UQ.

axioms (1)
  • domain assumption Ensemble and knowledge-distillation methods produce well-calibrated uncertainty estimates for spatiotemporal wildfire data.
    Implicit in the comparison of ensemble and student model calibration.
invented entities (1)
  • Fire-Centered Evaluation Region (FCER) no independent evidence
    purpose: Spatially conditioned protocol to characterize UQ within critical fire zones.
    Newly defined evaluation framework with no independent prior validation cited.

pith-pipeline@v0.9.0 · 5409 in / 1101 out tokens · 54614 ms · 2026-05-08T18:26:02.025495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Connections of climate change and variability to large and extreme forest fires in southeast australia.Communications Earth & Environment, 2(1), 2021

    Nerilie J Abram, Benjamin J Henley, Alex Sen Gupta, Tanya JR Lippmann, Hamish Clarke, Andrew J Dowdy, Jason J Sharples, Rachael H Nolan, Tianran Zhang, Martin J Wooster, et al. Connections of climate change and variability to large and extreme forest fires in southeast australia.Communications Earth & Environment, 2(1), 2021

  2. [2]

    Wildfires on a changing planet.Nature Communications, 17(1), 2026

    Olivia Haas, Iain Colin Prentice, and Sandy P Harrison. Wildfires on a changing planet.Nature Communications, 17(1), 2026

  3. [3]

    Machine learning and deep learning for wildfire spread prediction: A review.Fire, 7(12), 2024

    Henintsoa S Andrianarivony and Moulay A Akhloufi. Machine learning and deep learning for wildfire spread prediction: A review.Fire, 7(12), 2024

  4. [4]

    WildfireSpreadTS: A dataset of multi- modal time series for wildfire spread prediction

    Sebastian Gerard, Yu Zhao, and Josephine Sullivan. WildfireSpreadTS: A dataset of multi- modal time series for wildfire spread prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  5. [5]

    Spatial uncertainty quantification in wildfire forecasting for climate- resilient emergency planning

    Aditya Chakravarty. Spatial uncertainty quantification in wildfire forecasting for climate- resilient emergency planning. InNeurIPS Workshop on Tackling Climate Change with Machine Learning, 2025

  6. [6]

    Spatially-aware evaluation of segmentation uncertainty

    Tal Zeevi, Eléonore V Lieffrig, Lawrence H Staib, and John A Onofrey. Spatially-aware evaluation of segmentation uncertainty. InCVPR 4th Workshop on Uncertainty Quantification for Computer Vision, 2025

  7. [7]

    Dudes: Deep uncertainty distillation using ensembles for semantic segmentation.PFG – Journal of Pho- togrammetry, Remote Sensing and Geoinformation Science, 92(2), 2024

    Steven Landgraf, Kira Wursthorn, Markus Hillemann, and Markus Ulrich. Dudes: Deep uncertainty distillation using ensembles for semantic segmentation.PFG – Journal of Pho- togrammetry, Remote Sensing and Geoinformation Science, 92(2), 2024

  8. [8]

    Improved wildfire spread prediction with time-series data and the WSTS+ benchmark

    Saad Lahrichi, Jake Bova, Jesse Johnson, and Jordan Malof. Improved wildfire spread prediction with time-series data and the WSTS+ benchmark. InWinter Conference on Applications of Computer Vision (WACV), 2026

  9. [9]

    Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks

    Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks. InInternational Conference on Computer Vision (ICCV), 2021

  10. [10]

    Individual comparisons by ranking methods.Biometrics bulletin, 1(6), 1945

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6), 1945. 5 A Supplementary Material This appendix is organized as follows. Section A.1 defines all metrics used in the paper. Section A.2 provides additional FCER sweep plots omitted from the main paper for space, including per-year AUROC and AUPRC curves as well as mean cal...