Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

Liling Chang; Qi Wang; Yangshuang Xu; Yushun Dong; Yuyang Dai

arxiv: 2605.18911 · v1 · pith:GMSZIP2Mnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

Yangshuang Xu , Yuyang Dai , Liling Chang , Qi Wang , Yushun Dong This is my paper

Pith reviewed 2026-05-20 20:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords wildfire predictionfoundation modelsevaluation frameworktransfer learningsparse eventsEarth observationmodel benchmarking

0 comments

The pith

Wildfire model performance rankings depend on evaluation rules

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WILDFIRE-FM, a foundation model pretrained specifically for wildfire prediction using weather, fire observations, and environmental data. It argues that because wildfire events are rare, standard evaluations can mislead about how well models transfer to new settings. To fix this, the authors propose a fixed-contract evaluation framework that includes a fixed-output check to control for matching rules and a fixed-feature check to control for head selection. Results from comparing the new model against ten baselines across multiple tasks show that conclusions about which model works best change with the evaluation choices. A sympathetic reader would care because this means many reported successes in wildfire forecasting might not hold up under stricter scrutiny.

Core claim

Under a fixed-contract evaluation framework, comparisons between WILDFIRE-FM and ten Earth foundation model baselines across occupancy, spread, retrieval, and regression tasks demonstrate that wildfire transfer conclusions depend strongly on evaluation design and task formulation.

What carries the argument

The fixed-contract evaluation framework, which uses a fixed-output check to isolate matching-rule effects and a fixed-feature check to isolate head-selection effects, ensuring controlled comparisons in sparse wildfire data.

If this is right

Wildfire foundation model performance must be assessed under matched contracts to yield reliable transfer conclusions.
Task formulation influences whether a model appears effective for prediction or retrieval.
The new WILDFIRE-FM provides a domain-specific backbone that can be evaluated fairly using this framework.
Future benchmarking in wildfire research should adopt similar controlled checks to avoid confounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evaluation sensitivities likely exist in other domains with sparse rare events, such as earthquake prediction or disease outbreak forecasting.
Without such frameworks, claims about foundation model superiority in Earth science applications may often be artifacts of evaluation design rather than true improvements.
Researchers could test the framework on historical wildfire datasets to verify if past model comparisons hold under fixed contracts.

Load-bearing premise

The fixed-output check and fixed-feature check together control for the main confounding factors that affect transfer conclusions in sparse-event settings.

What would settle it

If model rankings and transfer conclusions remain the same across different matching rules and feature selections when using the fixed-contract framework, this would show that evaluation design does not strongly affect conclusions.

Figures

Figures reproduced from arXiv: 2605.18911 by Liling Chang, Qi Wang, Yangshuang Xu, Yushun Dong, Yuyang Dai.

**Figure 3.** Figure 3: Evaluation contract map for the six fixed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Primary-task rank changes (RQ1). Cells show rank before→after. Green/red/gray mark moving up/down/no change; darker green or red marks a larger move. Following Section 3.3, Ex/Tol/Un are occupancy exact, tolerated, and union matching; Sp is spread spatial-overlap F1. Because both tasks involve spatially sparse targets, fire-active cells for occupancy, burned raster patches for spread, the operational ass… view at source ↗

**Figure 5.** Figure 5: Head-selection regret under fixed features (RQ2). Each point is one backbone; selection regret δ follows Section 3.4 under globalscope union-F1. To answer RQ2, we conduct a fixed-feature check on occupancy and fire spread tasks, holding the frozen feature source, T , Ω, Λ, and candidate head family H ⊆ A fixed while varying only the selection metric between PR-AUCbased and decision-F1-based selectio… view at source ↗

**Figure 6.** Figure 6: Rank map for supporting task comparison (RQ4). Each row fixes one task contract C and ranks the eligible backbones within that contract. The figure shows rank changes across task forms; native metric values are reported in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Matching-rule sensitivity in fire-prone occupancy (RQ1). Each row holds the score field S, label field Y , threshold, and Ω fixed, and changes only Λ. Legend: ■ strict F1, ■ added F1 from spatial tolerance, ■ added F1 from union matching, red outline WILDFIRE-FM, and dashed line original weather FMs vs. added baselines. The horizontal axis is F1 in percent. B Controlled Check Details B.1 Fixed-Output Check… view at source ↗

read the original abstract

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows wildfire model rankings shift with evaluation rules and offers a domain-specific FM plus fixed-contract checks to make comparisons more controlled.

read the letter

The main thing to know is that they built WILDFIRE-FM, pretrained directly on wildfire-relevant inputs like active fires, weather, topography and vegetation, and paired it with a fixed-contract evaluation that uses a fixed-output check for matching rules plus a fixed-feature check for head selection. This setup is meant to reduce the sensitivity of transfer results when events are sparse. They run comparisons against ten Earth-FM baselines on occupancy, spread, retrieval and regression tasks and conclude that rankings depend heavily on those design choices. That is the concrete contribution. The framework idea is practical for anyone who has run into the same problem with rare-event data, and releasing the code helps others test it. The pretraining choice itself is a straightforward but useful specialization rather than another general Earth model. The central claim holds up in principle because the abstract makes clear they are isolating matching and feature effects rather than claiming a universal win. The soft spot is the lack of reported numbers, error bars, dataset sizes or ablation tables in the summary, which makes it hard to judge the size of the effect or whether the two checks actually cover the main confounders. If the full experiments show only modest shifts or leave other factors like label noise unaddressed, the practical impact shrinks. This is for researchers working on foundation models for environmental forecasting or on reliable evaluation for imbalanced spatial tasks. A reader who needs to compare models on wildfire or similar sparse phenomena would get direct value from the framework. It is solid enough on its own terms to deserve peer review, mainly so referees can check the implementation details and the strength of the quantitative evidence.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction on weather, active-fire observations, topography, vegetation, and static environmental data. It argues that transfer conclusions for such sparse-event tasks are highly sensitive to evaluation design and task formulation, and proposes a fixed-contract framework consisting of a fixed-output check (for matching-rule effects) and a fixed-feature check (for head-selection effects). The paper then compares WILDFIRE-FM against ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks, concluding that performance rankings and transfer claims vary substantially depending on the chosen contract.

Significance. If the empirical comparisons hold under the proposed controls, the work usefully demonstrates the fragility of standard transfer evaluations in rare-event domains and supplies both a domain-specific backbone and an evaluation protocol that future wildfire and Earth-science ML studies can adopt. The public release of code is a clear strength that supports reproducibility.

major comments (2)

[§4.2] §4.2 (Fixed-contract framework): the assertion that the fixed-output and fixed-feature checks together 'control for the main confounding factors' is load-bearing for the central claim, yet the manuscript provides no explicit ablation or sensitivity analysis showing that other factors (e.g., temporal split strategies or class imbalance handling) are neutralized; without this, the dependence result risks being partly attributable to uncontrolled variables.
[Results section, Table 3] Results section, Table 3 (or equivalent cross-task summary): the reported performance gaps between WILDFIRE-FM and the Earth-FM baselines under different contracts are presented without error bars, statistical significance tests, or multiple random seeds; this weakens the claim that conclusions 'depend strongly' on evaluation design, as the magnitude and reliability of the sensitivity cannot be assessed.

minor comments (3)

[Abstract] Abstract: the headline result is stated qualitatively; adding one concrete numerical example (e.g., 'occupancy AUC drops from 0.82 to 0.61 when switching from fixed-output to standard matching') would make the sensitivity claim more immediate for readers.
[§3] §3 (Pretraining details): the loss function and masking strategy used for WILDFIRE-FM are described at a high level; specifying the exact objective (e.g., cross-entropy weights for active-fire pixels) would aid replication.
[Figure 2] Figure 2 (evaluation contract diagram): the arrows and boxes are difficult to follow at print size; adding a small legend or numbered steps would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the value of the fixed-contract framework and the public code release. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4.2] §4.2 (Fixed-contract framework): the assertion that the fixed-output and fixed-feature checks together 'control for the main confounding factors' is load-bearing for the central claim, yet the manuscript provides no explicit ablation or sensitivity analysis showing that other factors (e.g., temporal split strategies or class imbalance handling) are neutralized; without this, the dependence result risks being partly attributable to uncontrolled variables.

Authors: We agree that isolating the effects of the evaluation contract requires demonstrating that other design choices do not drive the observed sensitivity. The fixed-output check standardizes the prediction format and matching rules, while the fixed-feature check standardizes the input representation and head architecture; these two controls were chosen because they directly address the most common sources of non-comparable transfer results in sparse-event settings. Nevertheless, the original manuscript does not contain explicit sensitivity analyses for temporal split strategies or class-imbalance handling. In the revised version we will add a dedicated subsection that re-runs the cross-contract comparisons under alternative temporal splits (e.g., year-based vs. random) and under different imbalance-correction regimes, confirming that the ranking reversals across contracts persist. This addition will make the claim that conclusions depend strongly on the contract more robust. revision: yes
Referee: [Results section, Table 3] Results section, Table 3 (or equivalent cross-task summary): the reported performance gaps between WILDFIRE-FM and the Earth-FM baselines under different contracts are presented without error bars, statistical significance tests, or multiple random seeds; this weakens the claim that conclusions 'depend strongly' on evaluation design, as the magnitude and reliability of the sensitivity cannot be assessed.

Authors: We accept that the absence of variability estimates and formal statistical tests limits the strength of the sensitivity claim. In the revised manuscript we will re-execute all experiments with at least three independent random seeds, report means and standard deviations as error bars in Table 3 and the associated figures, and include paired statistical significance tests (with appropriate multiple-comparison correction) between contracts. These additions will allow readers to assess both the magnitude and the reliability of the performance differences that arise when the evaluation contract is changed. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical comparison study that introduces WILDFIRE-FM and a fixed-contract evaluation framework consisting of fixed-output and fixed-feature checks, then reports results across occupancy, spread, retrieval, and regression tasks. All load-bearing claims rest on direct model comparisons under matched evaluation contracts rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results, and the framework is presented as a methodological contribution whose effects are measured externally against baselines. The work is therefore self-contained against its own empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard machine-learning assumptions about pretraining and transfer plus the new model and evaluation procedure; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Standard machine-learning assumptions about pretraining objectives and transfer learning apply to wildfire data.
The work relies on typical foundation-model practices without stating unique mathematical axioms.

invented entities (1)

WILDFIRE-FM no independent evidence
purpose: Domain-specific foundation model for wildfire prediction
New model introduced by the paper; no independent evidence outside this work is mentioned.

pith-pipeline@v0.9.0 · 5747 in / 1273 out tokens · 101763 ms · 2026-05-20T20:40:59.687853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

work page 2023
[2]

Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Riechert, Jonathan A

Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

work page arXiv 2024
[3]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

work page arXiv 2023
[5]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

work page 2023
[6]

Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009

work page 2009
[7]

Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

work page 2020
[8]

Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

work page 2023
[9]

Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

work page 2009
[10]

Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards

Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999

work page 1999
[11]

Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

work page arXiv 2022
[12]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021
[13]

Giannaros, et al

Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10

work page 2020
[14]

GEO- Bench: Toward foundation models for earth monitoring

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[15]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

work page 2023
[16]

LANDFIRE 40 Fire Behavior Fuel Models

LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05

work page 2026
[17]

LANDFIRE Forest Canopy Cover

LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05

work page 2026
[18]

Lüth, Till J

Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023

work page 2023
[19]

arXiv preprint arXiv:2412.04204 , year =

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

work page arXiv 2024
[20]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[21]

Fire Information for Resource Management System (FIRMS)

NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05

work page 2026
[22]

Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters

National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05

work page 2026
[23]

K., and Grover, A

Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

work page arXiv 2023
[24]

Rapid Refresh / High-Resolution Rapid Refresh

NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05

work page 2026
[25]

High- Resolution Rapid Refresh (HRRR)

NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05

work page 2026
[26]

LandScan Global 2024

Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,

work page 2024
[28]

Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

work page 2026
[29]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

work page 2017
[31]

Firecast: Leveraging deep learning to predict wildfire spread

David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11

work page 2019
[32]

WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

work page 2024
[33]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

work page arXiv 2023
[34]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015

work page 2015
[35]

Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

work page 2023
[36]

Prithvi wxc: Foun- dation model for weather and climate,

Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024
[37]

Stewart, Caleb Robinson, Isaac A

Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022

work page 2022
[38]

Bungert, Carsten T

Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[39]

Geological Survey and USDA Forest Service

U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05

work page 2025
[40]

Wildfire Risk to Communities: Housing Unit Density Image Service

USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,

work page
[41]

Accessed: 2026-05-05

work page 2026
[42]

Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

work page 2020
[43]

Lobell, and Stefano Ermon

Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...

work page arXiv 2021

[1] [1]

Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

work page 2023

[2] [2]

Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Riechert, Jonathan A

Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

work page arXiv 2024

[3] [3]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

work page arXiv 2023

[5] [5]

Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

work page 2023

[6] [6]

Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009

work page 2009

[7] [7]

Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

work page 2020

[8] [8]

Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

work page 2023

[9] [9]

Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

work page 2009

[10] [10]

Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards

Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999

work page 1999

[11] [11]

Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

work page arXiv 2022

[12] [12]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021

work page 2021

[13] [13]

Giannaros, et al

Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10

work page 2020

[14] [14]

GEO- Bench: Toward foundation models for earth monitoring

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[15] [15]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

work page 2023

[16] [16]

LANDFIRE 40 Fire Behavior Fuel Models

LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05

work page 2026

[17] [17]

LANDFIRE Forest Canopy Cover

LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05

work page 2026

[18] [18]

Lüth, Till J

Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023

work page 2023

[19] [19]

arXiv preprint arXiv:2412.04204 , year =

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

work page arXiv 2024

[20] [20]

McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[21] [21]

Fire Information for Resource Management System (FIRMS)

NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05

work page 2026

[22] [22]

Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters

National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05

work page 2026

[23] [23]

K., and Grover, A

Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

work page arXiv 2023

[24] [24]

Rapid Refresh / High-Resolution Rapid Refresh

NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05

work page 2026

[25] [25]

High- Resolution Rapid Refresh (HRRR)

NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05

work page 2026

[26] [26]

LandScan Global 2024

Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,

work page 2024

[27] [28]

Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

work page 2026

[28] [29]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [30]

An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

work page 2017

[30] [31]

Firecast: Leveraging deep learning to predict wildfire spread

David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11

work page 2019

[31] [32]

WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

work page 2024

[32] [33]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

work page arXiv 2023

[33] [34]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015

work page 2015

[34] [35]

Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

work page 2023

[35] [36]

Prithvi wxc: Foun- dation model for weather and climate,

Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

work page arXiv 2024

[36] [37]

Stewart, Caleb Robinson, Isaac A

Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022

work page 2022

[37] [38]

Bungert, Carsten T

Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[38] [39]

Geological Survey and USDA Forest Service

U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05

work page 2025

[39] [40]

Wildfire Risk to Communities: Housing Unit Density Image Service

USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,

work page

[40] [41]

Accessed: 2026-05-05

work page 2026

[41] [42]

Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

work page 2020

[42] [43]

Lobell, and Stefano Ermon

Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...

work page arXiv 2021