Does Your Wildfire Prediction Model Actually Work, or Just Score Well?
Pith reviewed 2026-05-25 05:59 UTC · model grok-4.3
The pith
Wildfire model transfer conclusions depend strongly on evaluation design and task formulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a fixed-contract evaluation framework with fixed-output and fixed-feature checks, comparisons of WILDFIRE-FM against ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks show that wildfire transfer conclusions depend strongly on evaluation design and task formulation.
What carries the argument
Fixed-contract evaluation framework using a fixed-output check to isolate matching-rule effects and a fixed-feature check to isolate head-selection effects.
If this is right
- Model rankings can reverse when matching rules or task formulations are changed.
- Domain-specific pretraining effects become measurable only after evaluation contracts are explicitly matched.
- Standard single-score benchmarks for wildfire prediction are insufficient to support transfer claims.
- The framework applies uniformly to occupancy, spread, retrieval, and regression tasks.
Where Pith is reading between the lines
- The same evaluation sensitivities could appear in other prediction problems with rare events such as floods or disease outbreaks.
- Future studies might report performance across a range of contracts rather than a single design.
- The checks could be tested for robustness when applied to non-foundation-model architectures.
Load-bearing premise
The fixed-output check and fixed-feature check isolate matching-rule and head-selection effects without introducing their own uncontrolled biases.
What would settle it
A follow-up experiment that reapplies the fixed-contract checks to the same models on new wildfire data and finds that model rankings remain stable across different matching rules and task formulations.
Figures
read the original abstract
Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WILDFIRE-FM, the first foundation model pretrained specifically on weather, active-fire, topography, vegetation and static data for wildfire prediction tasks. It also proposes a fixed-contract evaluation framework containing a fixed-output check (to isolate matching-rule effects) and a fixed-feature check (to isolate head-selection effects). Using this framework the authors compare WILDFIRE-FM against ten existing Earth foundation models on occupancy, spread, retrieval and regression tasks and conclude that wildfire transfer conclusions depend strongly on evaluation design and task formulation.
Significance. If the controlled checks are shown to be free of distributional artifacts, the work would usefully highlight the fragility of transfer claims in sparse-event domains and supply both a domain-specific backbone and a reusable evaluation contract for future wildfire Earth-FM research. Code release is a clear reproducibility strength.
major comments (2)
- [Abstract / Evaluation Framework] Abstract / Evaluation Framework description: the central claim that transfer conclusions 'depend strongly on evaluation design' rests on the assertion that the fixed-output and fixed-feature checks cleanly separate matching-rule from head-selection effects. No quantitative verification is supplied that these controls preserve the original marginal distributions of rare wildfire events or that alternative isolation strategies yield consistent sensitivity rankings; without such evidence the observed dependence could be partly an artifact of the controls themselves.
- [Abstract] Abstract: the manuscript states that 'under matched contracts' WILDFIRE-FM is compared with ten baselines across four task families, yet supplies no numerical results, confidence intervals, or ablation tables showing the magnitude of the reported sensitivity. This absence prevents assessment of whether the dependence is practically large enough to alter model rankings or merely statistically detectable.
minor comments (2)
- [Abstract] The anonymous code link is helpful for reproducibility but should be replaced by a permanent repository (e.g., Zenodo or GitHub) before publication.
- [Title] The title poses a binary question ('Actually Work, or Just Score Well?') that the abstract does not directly answer with a decisive verdict; a more descriptive title would better align with the methodological contribution.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our evaluation framework. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract / Evaluation Framework] Abstract / Evaluation Framework description: the central claim that transfer conclusions 'depend strongly on evaluation design' rests on the assertion that the fixed-output and fixed-feature checks cleanly separate matching-rule from head-selection effects. No quantitative verification is supplied that these controls preserve the original marginal distributions of rare wildfire events or that alternative isolation strategies yield consistent sensitivity rankings; without such evidence the observed dependence could be partly an artifact of the controls themselves.
Authors: The fixed-output check holds output format, matching rules, and data splits fixed while varying only the backbone, while the fixed-feature check holds input features and backbone fixed while varying only the prediction head. This design isolates the targeted effects by construction. We acknowledge that explicit verification (e.g., statistical tests confirming marginal distributions of rare events remain unchanged) was not reported in the initial submission. In revision we will add an appendix with such verification (Kolmogorov-Smirnov statistics on event frequency and spatial distribution) and a brief discussion of why alternative isolation methods were not needed for the sensitivity ranking. revision: yes
-
Referee: [Abstract] Abstract: the manuscript states that 'under matched contracts' WILDFIRE-FM is compared with ten baselines across four task families, yet supplies no numerical results, confidence intervals, or ablation tables showing the magnitude of the reported sensitivity. This absence prevents assessment of whether the dependence is practically large enough to alter model rankings or merely statistically detectable.
Authors: The abstract is intentionally concise and states the high-level conclusion. All quantitative results—including per-task performance tables, confidence intervals, ranking changes across the four task families, and ablation tables that quantify how evaluation design alters model orderings—are provided in Sections 4–5 and the associated figures/tables of the full manuscript. We therefore disagree that the abstract must contain the numbers; however, if the editor requests, we can add one or two representative effect-size examples to the abstract in revision. revision: no
Circularity Check
No circularity: evaluation framework and model introduced as independent methodological contributions.
full rationale
The manuscript introduces WILDFIRE-FM and a fixed-contract evaluation framework (fixed-output check for matching-rule effects; fixed-feature check for head-selection effects) without any derivations, equations, or fitted parameters that reduce to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are described. The central claim that transfer conclusions depend on evaluation design rests on comparisons performed under the newly defined framework, which does not collapse to its own inputs by construction. This is the most common honest finding for a purely methodological paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023
work page 2023
-
[2]
Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024
Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024
-
[3]
Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023
-
[5]
Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023
work page 2023
-
[6]
Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009
work page 2009
-
[7]
Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020
work page 2020
-
[8]
Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023
work page 2023
-
[9]
Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009
work page 2009
-
[10]
Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999
work page 1999
-
[11]
Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022
Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022
-
[12]
WILDS: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021
work page 2021
-
[13]
Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10
work page 2020
-
[14]
GEO- Bench: Toward foundation models for earth monitoring
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[15]
Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023
work page 2023
-
[16]
LANDFIRE 40 Fire Behavior Fuel Models
LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05
work page 2026
-
[17]
LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05
work page 2026
-
[18]
Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[19]
Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024
-
[20]
McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant
Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[21]
Fire Information for Resource Management System (FIRMS)
NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05
work page 2026
-
[22]
Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters
National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05
work page 2026
-
[23]
Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023
Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023
-
[24]
Rapid Refresh / High-Resolution Rapid Refresh
NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05
work page 2026
-
[25]
High- Resolution Rapid Refresh (HRRR)
NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05
work page 2026
-
[26]
Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,
work page 2024
-
[28]
Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026
work page 2026
-
[29]
Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017
work page 2017
-
[31]
Firecast: Leveraging deep learning to predict wildfire spread
David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11
work page 2019
-
[32]
Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024
work page 2024
-
[33]
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023
-
[34]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015
work page 2015
-
[35]
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[36]
Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024
Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024
-
[37]
Stewart, Caleb Robinson, Isaac A
Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022
work page 2022
-
[38]
Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[39]
Geological Survey and USDA Forest Service
U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05
work page 2025
-
[40]
Wildfire Risk to Communities: Housing Unit Density Image Service
USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,
-
[41]
Accessed: 2026-05-05
work page 2026
-
[42]
Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020
work page 2020
-
[43]
Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.