Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

Liling Chang; Qi Wang; Yangshuang Xu; Yushun Dong; Yuyang Dai

arxiv: 2605.18911 · v2 · pith:GMSZIP2Mnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

Yangshuang Xu , Yuyang Dai , Liling Chang , Qi Wang , Yushun Dong This is my paper

Pith reviewed 2026-05-25 05:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords wildfire predictionfoundation modelsevaluation frameworktransfer learningEarth observationmodel benchmarkingsparse events

0 comments

The pith

Wildfire model transfer conclusions depend strongly on evaluation design and task formulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WILDFIRE-FM as the first foundation model pretrained specifically on wildfire data including weather, active fires, topography, vegetation, and environmental variables. It shows that sparse wildfire events make standard model comparisons unreliable because results shift with how positive and negative examples are matched and which prediction heads are chosen. To address this, the authors create a fixed-contract evaluation framework that uses a fixed-output check to control matching-rule effects and a fixed-feature check to control head-selection effects. When ten Earth foundation model baselines are compared to WILDFIRE-FM under matched contracts across occupancy, spread, retrieval, and regression tasks, the relative performance and transfer claims change with the exact evaluation choices. This demonstrates that reliable wildfire forecasting requires evaluation methods that isolate these design factors rather than relying on uncontrolled benchmarks.

Core claim

Under a fixed-contract evaluation framework with fixed-output and fixed-feature checks, comparisons of WILDFIRE-FM against ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks show that wildfire transfer conclusions depend strongly on evaluation design and task formulation.

What carries the argument

Fixed-contract evaluation framework using a fixed-output check to isolate matching-rule effects and a fixed-feature check to isolate head-selection effects.

If this is right

Model rankings can reverse when matching rules or task formulations are changed.
Domain-specific pretraining effects become measurable only after evaluation contracts are explicitly matched.
Standard single-score benchmarks for wildfire prediction are insufficient to support transfer claims.
The framework applies uniformly to occupancy, spread, retrieval, and regression tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluation sensitivities could appear in other prediction problems with rare events such as floods or disease outbreaks.
Future studies might report performance across a range of contracts rather than a single design.
The checks could be tested for robustness when applied to non-foundation-model architectures.

Load-bearing premise

The fixed-output check and fixed-feature check isolate matching-rule and head-selection effects without introducing their own uncontrolled biases.

What would settle it

A follow-up experiment that reapplies the fixed-contract checks to the same models on new wildfire data and finds that model rankings remain stable across different matching rules and task formulations.

Figures

Figures reproduced from arXiv: 2605.18911 by Liling Chang, Qi Wang, Yangshuang Xu, Yushun Dong, Yuyang Dai.

**Figure 3.** Figure 3: Evaluation contract map for the six fixed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Primary-task rank changes (RQ1). Cells show rank before→after. Green/red/gray mark moving up/down/no change; darker green or red marks a larger move. Following Section 3.3, Ex/Tol/Un are occupancy exact, tolerated, and union matching; Sp is spread spatial-overlap F1. Because both tasks involve spatially sparse targets, fire-active cells for occupancy, burned raster patches for spread, the operational ass… view at source ↗

**Figure 5.** Figure 5: Head-selection regret under fixed features (RQ2). Each point is one backbone; selection regret δ follows Section 3.4 under globalscope union-F1. To answer RQ2, we conduct a fixed-feature check on occupancy and fire spread tasks, holding the frozen feature source, T , Ω, Λ, and candidate head family H ⊆ A fixed while varying only the selection metric between PR-AUCbased and decision-F1-based selectio… view at source ↗

**Figure 6.** Figure 6: Rank map for supporting task comparison (RQ4). Each row fixes one task contract C and ranks the eligible backbones within that contract. The figure shows rank changes across task forms; native metric values are reported in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Matching-rule sensitivity in fire-prone occupancy (RQ1). Each row holds the score field S, label field Y , threshold, and Ω fixed, and changes only Λ. Legend: ■ strict F1, ■ added F1 from spatial tolerance, ■ added F1 from union matching, red outline WILDFIRE-FM, and dashed line original weather FMs vs. added baselines. The horizontal axis is F1 in percent. B Controlled Check Details B.1 Fixed-Output Check… view at source ↗

read the original abstract

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a wildfire-specific FM and an eval framework showing design sensitivity, but the fixed checks likely introduce their own distribution shifts in sparse data.

read the letter

The main takeaway is that wildfire transfer results are sensitive to how you set up matching and task formulation, and the authors try to make that concrete with WILDFIRE-FM plus a fixed-contract framework using fixed-output and fixed-feature checks. They pretrain on weather, fire observations, topography and vegetation, then compare against ten Earth FMs on occupancy, spread, retrieval and regression tasks. The code link is useful if someone wants to inspect the setup. That part is straightforward and addresses a real practical gap in sparse-event modeling. The framework idea itself is new in this domain and could help standardize comparisons. The soft spot is exactly the one in the stress-test note. Fixing outputs or features in a setting where events are rare in space and time will change which cases survive the matching step and how targets are normalized. Nothing in the abstract shows that the marginal distributions are preserved or that the sensitivity result holds under different isolation choices, so the central claim about evaluation dependence could be partly an artifact of the controls. Without those checks or quantitative distribution comparisons, the evidence for clean separation of matching-rule versus head-selection effects is weak. This is for people doing ML on environmental disasters or rare-event forecasting who care about benchmarking. A reader working on evaluation contracts for imbalanced data might borrow the fixed-contract idea, but the specific checks need tighter validation before they become standard. It should go to peer review because the problem it flags is real and the model plus code give something concrete to work with, even if the framework requires more scrutiny on bias.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WILDFIRE-FM, the first foundation model pretrained specifically on weather, active-fire, topography, vegetation and static data for wildfire prediction tasks. It also proposes a fixed-contract evaluation framework containing a fixed-output check (to isolate matching-rule effects) and a fixed-feature check (to isolate head-selection effects). Using this framework the authors compare WILDFIRE-FM against ten existing Earth foundation models on occupancy, spread, retrieval and regression tasks and conclude that wildfire transfer conclusions depend strongly on evaluation design and task formulation.

Significance. If the controlled checks are shown to be free of distributional artifacts, the work would usefully highlight the fragility of transfer claims in sparse-event domains and supply both a domain-specific backbone and a reusable evaluation contract for future wildfire Earth-FM research. Code release is a clear reproducibility strength.

major comments (2)

[Abstract / Evaluation Framework] Abstract / Evaluation Framework description: the central claim that transfer conclusions 'depend strongly on evaluation design' rests on the assertion that the fixed-output and fixed-feature checks cleanly separate matching-rule from head-selection effects. No quantitative verification is supplied that these controls preserve the original marginal distributions of rare wildfire events or that alternative isolation strategies yield consistent sensitivity rankings; without such evidence the observed dependence could be partly an artifact of the controls themselves.
[Abstract] Abstract: the manuscript states that 'under matched contracts' WILDFIRE-FM is compared with ten baselines across four task families, yet supplies no numerical results, confidence intervals, or ablation tables showing the magnitude of the reported sensitivity. This absence prevents assessment of whether the dependence is practically large enough to alter model rankings or merely statistically detectable.

minor comments (2)

[Abstract] The anonymous code link is helpful for reproducibility but should be replaced by a permanent repository (e.g., Zenodo or GitHub) before publication.
[Title] The title poses a binary question ('Actually Work, or Just Score Well?') that the abstract does not directly answer with a decisive verdict; a more descriptive title would better align with the methodological contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our evaluation framework. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / Evaluation Framework] Abstract / Evaluation Framework description: the central claim that transfer conclusions 'depend strongly on evaluation design' rests on the assertion that the fixed-output and fixed-feature checks cleanly separate matching-rule from head-selection effects. No quantitative verification is supplied that these controls preserve the original marginal distributions of rare wildfire events or that alternative isolation strategies yield consistent sensitivity rankings; without such evidence the observed dependence could be partly an artifact of the controls themselves.

Authors: The fixed-output check holds output format, matching rules, and data splits fixed while varying only the backbone, while the fixed-feature check holds input features and backbone fixed while varying only the prediction head. This design isolates the targeted effects by construction. We acknowledge that explicit verification (e.g., statistical tests confirming marginal distributions of rare events remain unchanged) was not reported in the initial submission. In revision we will add an appendix with such verification (Kolmogorov-Smirnov statistics on event frequency and spatial distribution) and a brief discussion of why alternative isolation methods were not needed for the sensitivity ranking. revision: yes
Referee: [Abstract] Abstract: the manuscript states that 'under matched contracts' WILDFIRE-FM is compared with ten baselines across four task families, yet supplies no numerical results, confidence intervals, or ablation tables showing the magnitude of the reported sensitivity. This absence prevents assessment of whether the dependence is practically large enough to alter model rankings or merely statistically detectable.

Authors: The abstract is intentionally concise and states the high-level conclusion. All quantitative results—including per-task performance tables, confidence intervals, ranking changes across the four task families, and ablation tables that quantify how evaluation design alters model orderings—are provided in Sections 4–5 and the associated figures/tables of the full manuscript. We therefore disagree that the abstract must contain the numbers; however, if the editor requests, we can add one or two representative effect-size examples to the abstract in revision. revision: no

Circularity Check

0 steps flagged

No circularity: evaluation framework and model introduced as independent methodological contributions.

full rationale

The manuscript introduces WILDFIRE-FM and a fixed-contract evaluation framework (fixed-output check for matching-rule effects; fixed-feature check for head-selection effects) without any derivations, equations, or fitted parameters that reduce to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are described. The central claim that transfer conclusions depend on evaluation design rests on comparisons performed under the newly defined framework, which does not collapse to its own inputs by construction. This is the most common honest finding for a purely methodological paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on the unstated assumption that the proposed checks control confounding factors in sparse-event evaluation.

pith-pipeline@v0.9.0 · 5747 in / 1064 out tokens · 21936 ms · 2026-05-25T05:59:35.410223+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

work page 2023
[2]

Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

work page arXiv 2024
[3]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

work page arXiv 2023
[5]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

work page 2023
[6]

Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009

work page 2009
[7]

Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

work page 2020
[8]

Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

work page 2023
[9]

Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

work page 2009
[10]

Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards

Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999

work page 1999
[11]

Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

work page arXiv 2022
[12]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021
[13]

Giannaros, et al

Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10

work page 2020
[14]

GEO- Bench: Toward foundation models for earth monitoring

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[15]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

work page 2023
[16]

LANDFIRE 40 Fire Behavior Fuel Models

LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05

work page 2026
[17]

LANDFIRE Forest Canopy Cover

LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05

work page 2026
[18]

Lüth, Till J

Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023

work page 2023
[19]

Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

work page arXiv 2024
[20]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[21]

Fire Information for Resource Management System (FIRMS)

NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05

work page 2026
[22]

Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters

National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05

work page 2026
[23]

Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

work page arXiv 2023
[24]

Rapid Refresh / High-Resolution Rapid Refresh

NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05

work page 2026
[25]

High- Resolution Rapid Refresh (HRRR)

NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05

work page 2026
[26]

LandScan Global 2024

Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,

work page 2024
[28]

Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

work page 2026
[29]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

work page 2017
[31]

Firecast: Leveraging deep learning to predict wildfire spread

David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11

work page 2019
[32]

WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

work page 2024
[33]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

work page arXiv 2023
[34]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015

work page 2015
[35]

Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

work page 2023
[36]

Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024
[37]

Stewart, Caleb Robinson, Isaac A

Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022

work page 2022
[38]

Bungert, Carsten T

Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[39]

Geological Survey and USDA Forest Service

U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05

work page 2025
[40]

Wildfire Risk to Communities: Housing Unit Density Image Service

USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,

work page
[41]

Accessed: 2026-05-05

work page 2026
[42]

Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

work page 2020
[43]

Lobell, and Stefano Ermon

Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...

work page arXiv 2021

[1] [1]

Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

work page 2023

[2] [2]

Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

work page arXiv 2024

[3] [3]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

work page arXiv 2023

[5] [5]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

work page 2023

[6] [6]

Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009

work page 2009

[7] [7]

Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

work page 2020

[8] [8]

Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

work page 2023

[9] [9]

Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

work page 2009

[10] [10]

Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards

Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999

work page 1999

[11] [11]

Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

work page arXiv 2022

[12] [12]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021

[13] [13]

Giannaros, et al

Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10

work page 2020

[14] [14]

GEO- Bench: Toward foundation models for earth monitoring

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[15] [15]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

work page 2023

[16] [16]

LANDFIRE 40 Fire Behavior Fuel Models

LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05

work page 2026

[17] [17]

LANDFIRE Forest Canopy Cover

LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05

work page 2026

[18] [18]

Lüth, Till J

Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023

work page 2023

[19] [19]

Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

work page arXiv 2024

[20] [20]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[21] [21]

Fire Information for Resource Management System (FIRMS)

NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05

work page 2026

[22] [22]

Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters

National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05

work page 2026

[23] [23]

Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

work page arXiv 2023

[24] [24]

Rapid Refresh / High-Resolution Rapid Refresh

NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05

work page 2026

[25] [25]

High- Resolution Rapid Refresh (HRRR)

NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05

work page 2026

[26] [26]

LandScan Global 2024

Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,

work page 2024

[27] [28]

Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

work page 2026

[28] [29]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [30]

An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

work page 2017

[30] [31]

Firecast: Leveraging deep learning to predict wildfire spread

David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11

work page 2019

[31] [32]

WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

work page 2024

[32] [33]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

work page arXiv 2023

[33] [34]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015

work page 2015

[34] [35]

Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

work page 2023

[35] [36]

Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024

[36] [37]

Stewart, Caleb Robinson, Isaac A

Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022

work page 2022

[37] [38]

Bungert, Carsten T

Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[38] [39]

Geological Survey and USDA Forest Service

U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05

work page 2025

[39] [40]

Wildfire Risk to Communities: Housing Unit Density Image Service

USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,

work page

[40] [41]

Accessed: 2026-05-05

work page 2026

[41] [42]

Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

work page 2020

[42] [43]

Lobell, and Stefano Ermon

Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...

work page arXiv 2021