pith. sign in

arxiv: 2605.18911 · v1 · pith:GMSZIP2Mnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

Pith reviewed 2026-05-20 20:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords wildfire predictionfoundation modelsevaluation frameworktransfer learningsparse eventsEarth observationmodel benchmarking
0
0 comments X

The pith

Wildfire model performance rankings depend on evaluation rules

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WILDFIRE-FM, a foundation model pretrained specifically for wildfire prediction using weather, fire observations, and environmental data. It argues that because wildfire events are rare, standard evaluations can mislead about how well models transfer to new settings. To fix this, the authors propose a fixed-contract evaluation framework that includes a fixed-output check to control for matching rules and a fixed-feature check to control for head selection. Results from comparing the new model against ten baselines across multiple tasks show that conclusions about which model works best change with the evaluation choices. A sympathetic reader would care because this means many reported successes in wildfire forecasting might not hold up under stricter scrutiny.

Core claim

Under a fixed-contract evaluation framework, comparisons between WILDFIRE-FM and ten Earth foundation model baselines across occupancy, spread, retrieval, and regression tasks demonstrate that wildfire transfer conclusions depend strongly on evaluation design and task formulation.

What carries the argument

The fixed-contract evaluation framework, which uses a fixed-output check to isolate matching-rule effects and a fixed-feature check to isolate head-selection effects, ensuring controlled comparisons in sparse wildfire data.

If this is right

  • Wildfire foundation model performance must be assessed under matched contracts to yield reliable transfer conclusions.
  • Task formulation influences whether a model appears effective for prediction or retrieval.
  • The new WILDFIRE-FM provides a domain-specific backbone that can be evaluated fairly using this framework.
  • Future benchmarking in wildfire research should adopt similar controlled checks to avoid confounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar evaluation sensitivities likely exist in other domains with sparse rare events, such as earthquake prediction or disease outbreak forecasting.
  • Without such frameworks, claims about foundation model superiority in Earth science applications may often be artifacts of evaluation design rather than true improvements.
  • Researchers could test the framework on historical wildfire datasets to verify if past model comparisons hold under fixed contracts.

Load-bearing premise

The fixed-output check and fixed-feature check together control for the main confounding factors that affect transfer conclusions in sparse-event settings.

What would settle it

If model rankings and transfer conclusions remain the same across different matching rules and feature selections when using the fixed-contract framework, this would show that evaluation design does not strongly affect conclusions.

Figures

Figures reproduced from arXiv: 2605.18911 by Liling Chang, Qi Wang, Yangshuang Xu, Yushun Dong, Yuyang Dai.

Figure 2
Figure 2. Figure 2: Matching rules for one fixed occupancy output. (a) Exact matching counts only same-time, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation contract map for the six fixed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Primary-task rank changes (RQ1). Cells show rank before→after. Green/red/gray mark moving up/down/no change; darker green or red marks a larger move. Following Section 3.3, Ex/Tol/Un are occupancy exact, tolerated, and union matching; Sp is spread spatial-overlap F1. Because both tasks involve spatially sparse tar￾gets, fire-active cells for occupancy, burned raster patches for spread, the operational as￾s… view at source ↗
Figure 5
Figure 5. Figure 5: Head-selection regret under fixed fea￾tures (RQ2). Each point is one backbone; se￾lection regret δ follows Section 3.4 under global￾scope union-F1. To answer RQ2, we conduct a fixed-feature check on occupancy and fire spread tasks, hold￾ing the frozen feature source, T , Ω, Λ, and can￾didate head family H ⊆ A fixed while vary￾ing only the selection metric between PR-AUC￾based and decision-F1-based selectio… view at source ↗
Figure 6
Figure 6. Figure 6: Rank map for supporting task comparison (RQ4). Each row fixes one task contract C and ranks the eligible backbones within that contract. The figure shows rank changes across task forms; native metric values are reported in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Matching-rule sensitivity in fire-prone occupancy (RQ1). Each row holds the score field S, label field Y , threshold, and Ω fixed, and changes only Λ. Legend: ■ strict F1, ■ added F1 from spatial tolerance, ■ added F1 from union matching, red outline WILDFIRE-FM, and dashed line original weather FMs vs. added baselines. The horizontal axis is F1 in percent. B Controlled Check Details B.1 Fixed-Output Check… view at source ↗
read the original abstract

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction on weather, active-fire observations, topography, vegetation, and static environmental data. It argues that transfer conclusions for such sparse-event tasks are highly sensitive to evaluation design and task formulation, and proposes a fixed-contract framework consisting of a fixed-output check (for matching-rule effects) and a fixed-feature check (for head-selection effects). The paper then compares WILDFIRE-FM against ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks, concluding that performance rankings and transfer claims vary substantially depending on the chosen contract.

Significance. If the empirical comparisons hold under the proposed controls, the work usefully demonstrates the fragility of standard transfer evaluations in rare-event domains and supplies both a domain-specific backbone and an evaluation protocol that future wildfire and Earth-science ML studies can adopt. The public release of code is a clear strength that supports reproducibility.

major comments (2)
  1. [§4.2] §4.2 (Fixed-contract framework): the assertion that the fixed-output and fixed-feature checks together 'control for the main confounding factors' is load-bearing for the central claim, yet the manuscript provides no explicit ablation or sensitivity analysis showing that other factors (e.g., temporal split strategies or class imbalance handling) are neutralized; without this, the dependence result risks being partly attributable to uncontrolled variables.
  2. [Results section, Table 3] Results section, Table 3 (or equivalent cross-task summary): the reported performance gaps between WILDFIRE-FM and the Earth-FM baselines under different contracts are presented without error bars, statistical significance tests, or multiple random seeds; this weakens the claim that conclusions 'depend strongly' on evaluation design, as the magnitude and reliability of the sensitivity cannot be assessed.
minor comments (3)
  1. [Abstract] Abstract: the headline result is stated qualitatively; adding one concrete numerical example (e.g., 'occupancy AUC drops from 0.82 to 0.61 when switching from fixed-output to standard matching') would make the sensitivity claim more immediate for readers.
  2. [§3] §3 (Pretraining details): the loss function and masking strategy used for WILDFIRE-FM are described at a high level; specifying the exact objective (e.g., cross-entropy weights for active-fire pixels) would aid replication.
  3. [Figure 2] Figure 2 (evaluation contract diagram): the arrows and boxes are difficult to follow at print size; adding a small legend or numbered steps would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the value of the fixed-contract framework and the public code release. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Fixed-contract framework): the assertion that the fixed-output and fixed-feature checks together 'control for the main confounding factors' is load-bearing for the central claim, yet the manuscript provides no explicit ablation or sensitivity analysis showing that other factors (e.g., temporal split strategies or class imbalance handling) are neutralized; without this, the dependence result risks being partly attributable to uncontrolled variables.

    Authors: We agree that isolating the effects of the evaluation contract requires demonstrating that other design choices do not drive the observed sensitivity. The fixed-output check standardizes the prediction format and matching rules, while the fixed-feature check standardizes the input representation and head architecture; these two controls were chosen because they directly address the most common sources of non-comparable transfer results in sparse-event settings. Nevertheless, the original manuscript does not contain explicit sensitivity analyses for temporal split strategies or class-imbalance handling. In the revised version we will add a dedicated subsection that re-runs the cross-contract comparisons under alternative temporal splits (e.g., year-based vs. random) and under different imbalance-correction regimes, confirming that the ranking reversals across contracts persist. This addition will make the claim that conclusions depend strongly on the contract more robust. revision: yes

  2. Referee: [Results section, Table 3] Results section, Table 3 (or equivalent cross-task summary): the reported performance gaps between WILDFIRE-FM and the Earth-FM baselines under different contracts are presented without error bars, statistical significance tests, or multiple random seeds; this weakens the claim that conclusions 'depend strongly' on evaluation design, as the magnitude and reliability of the sensitivity cannot be assessed.

    Authors: We accept that the absence of variability estimates and formal statistical tests limits the strength of the sensitivity claim. In the revised manuscript we will re-execute all experiments with at least three independent random seeds, report means and standard deviations as error bars in Table 3 and the associated figures, and include paired statistical significance tests (with appropriate multiple-comparison correction) between contracts. These additions will allow readers to assess both the magnitude and the reliability of the performance differences that arise when the evaluation contract is changed. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical comparison study that introduces WILDFIRE-FM and a fixed-contract evaluation framework consisting of fixed-output and fixed-feature checks, then reports results across occupancy, spread, retrieval, and regression tasks. All load-bearing claims rest on direct model comparisons under matched evaluation contracts rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core results, and the framework is presented as a methodological contribution whose effects are measured externally against baselines. The work is therefore self-contained against its own empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard machine-learning assumptions about pretraining and transfer plus the new model and evaluation procedure; no free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Standard machine-learning assumptions about pretraining objectives and transfer learning apply to wildfire data.
    The work relies on typical foundation-model practices without stating unique mathematical axioms.
invented entities (1)
  • WILDFIRE-FM no independent evidence
    purpose: Domain-specific foundation model for wildfire prediction
    New model introduced by the paper; no independent evidence outside this work is mentioned.

pith-pipeline@v0.9.0 · 5747 in / 1273 out tokens · 101763 ms · 2026-05-20T20:40:59.687853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

    Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu- weather: A 3d high-resolution model for fast and accurate global weather forecast.Nature, 619(7970):533–538, 2023

  2. [2]

    Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Riechert, Jonathan A

    Cristian Bodnar et al. Aurora: A foundation model of the atmosphere.arXiv preprint arXiv:2405.13063, 2024

  3. [3]

    AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

    Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

  4. [4]

    Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

    Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, et al. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023

  5. [5]

    Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

    Lei Chen, Xiaohui Zhong, Feng Zhang, Yuan Cheng, Yinghui Xu, Yuan Qi, and Hao Li. Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast.npj climate and atmospheric science, 6(1):190, 2023

  6. [6]

    Elizabeth E. Ebert. Neighborhood verification: A strategy for rewarding close forecasts.Weather and Forecasting, 24(6):1498–1510, 2009

  7. [7]

    Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

    Alireza Farahmand, E Natasha Stavros, John T Reager, and Ali Behrangi. Introducing spatially distributed fire danger from earth observations (fdeo) using satellite-based data in the contiguous united states.Remote Sensing, 12(8):1252, 2020

  8. [8]

    Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

    Sebastian Gerard, Yu Zhao, and Josephine Sullivan. Wildfirespreadts: A dataset of multi-modal time series for wildfire spread prediction.Advances in Neural Information Processing Systems, 36:74515–74529, 2023

  9. [9]

    Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

    Eric Gilleland, David Ahijevych, Barbara G Brown, and Elizabeth E Ebert. Intercomparison of spatial forecast verification methods.Weather and Forecasting, 24(5):1416–1430, 2009

  10. [10]

    Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards

    Johann Georg Goldammer. Early warning systems for the prediction of and appropriate response to wildfires and related environmental hazards. InEarly Warning Systems for Natural Disaster Reduction, 1999

  11. [11]

    Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

    Fantine Huot, R Lily Hu, Nita Goyal, Tharun Sankar, Matthias Ihme, and Yi-Fan Chen. Next day wildfire prediction using deep learning.arXiv preprint arXiv:2206.08930, 2022

  12. [12]

    WILDS: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InProceedings of the International Conference on Machine Learning, pages 5637–5664, 2021

  13. [13]

    Giannaros, et al

    Vassiliki Kotroni, Constantinos Cartalis, Silas Michaelides, Julia Stoyanova, Filippos Tymvios, Antonis Bezes, Theodoros Christoudias, Stavros Dafis, Christos Giannakopoulos, Theodore M. Giannaros, et al. Disarm early warning system for wildfires in the eastern mediterranean. Sustainability, 12(16):6670, 2020. 10

  14. [14]

    GEO- Bench: Toward foundation models for earth monitoring

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. GEO- Bench: Toward foundation models for earth monitoring. InAdvances in Neural Information Processing Systems, 2023

  15. [15]

    Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

  16. [16]

    LANDFIRE 40 Fire Behavior Fuel Models

    LANDFIRE. LANDFIRE 40 Fire Behavior Fuel Models. https://landfire.gov/fuel/ fbfm40, 2026. Accessed: 2026-05-05

  17. [17]

    LANDFIRE Forest Canopy Cover

    LANDFIRE. LANDFIRE Forest Canopy Cover. https://landfire.gov/fuel/cc, 2026. Accessed: 2026-05-05

  18. [18]

    Lüth, Till J

    Carsten T. Lüth, Till J. Bungert, Lukas Klein, and Paul F. Jäger. Navigating the pitfalls of active learning evaluation: A systematic framework for meaningful performance assessment. In Advances in Neural Information Processing Systems, 2023

  19. [19]

    arXiv preprint arXiv:2412.04204 , year =

    Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, et al. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

  20. [20]

    McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant

    Matthew B. McDermott, Haoran Zhang, Lasse Hyldig Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems, 2024

  21. [21]

    Fire Information for Resource Management System (FIRMS)

    NASA Earthdata. Fire Information for Resource Management System (FIRMS). https: //www.earthdata.nasa.gov/data/tools/firms, 2026. Accessed: 2026-05-05

  22. [22]

    Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters

    National Interagency Fire Center. Wildland Fire Interagency Geospatial Services (WFIGS): Current Perimeters. https://data-nifc.opendata.arcgis.com/datasets/nifc:: wfigs-current-perimeters/about, 2026. Accessed: 2026-05-05

  23. [23]

    K., and Grover, A

    Tung Nguyen, Johannes Brandstetter, Aditya Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate.arXiv preprint arXiv:2301.10343, 2023

  24. [24]

    Rapid Refresh / High-Resolution Rapid Refresh

    NOAA National Centers for Environmental Information. Rapid Refresh / High-Resolution Rapid Refresh. https://www.ncei.noaa.gov/products/weather-climate-models/ rapid-refresh-update, 2026. Accessed: 2026-05-05

  25. [25]

    High- Resolution Rapid Refresh (HRRR)

    NOAA National Centers for Environmental Prediction Environmental Modeling Center. High- Resolution Rapid Refresh (HRRR). https://rapidrefresh.noaa.gov/hrrr/, 2026. Ac- cessed: 2026-05-05

  26. [26]

    LandScan Global 2024

    Oak Ridge National Laboratory. LandScan Global 2024. https://landscan.ornl.gov/,

  27. [28]

    Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

    Jaideep Pathak, Yair Cohen, Piyush Garg, Peter Harrington, Noah Brenowitz, Dale Durran, Morteza Mardani, Arash Vahdat, Shaoming Xu, Karthik Kashinath, et al. Kilometer-scale convection-allowing model emulation using generative diffusion modeling.Science Advances, 12(5):eadv0423, 2026

  28. [29]

    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

    Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint arXiv:2202.11214, 2022

  29. [30]

    An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

    Paul D Pickell, Nicholas C Coops, Colin J Ferster, Christopher W Bater, Karen D Blouin, Mike D Flannigan, and Jinkai Zhang. An early warning system to forecast the close of the spring burning window from satellite-observed greenness.Scientific Reports, 7(1):14190, 2017

  30. [31]

    Firecast: Leveraging deep learning to predict wildfire spread

    David Radke, Anna Hessler, and David Ellsworth. Firecast: Leveraging deep learning to predict wildfire spread. InProceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4575–4581, 2019. 11

  31. [32]

    WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

    Stephan Rasp, Stephan Hoyer, Alex Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, et al. WeatherBench 2: A benchmark for the next generation of data-driven global weather models.Journal of Advances in Modeling Earth Systems, 16(6):e2023MS004019, 2024

  32. [33]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Stefano Ermon, and Ruslan Salakhutdinov. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning.arXiv preprint arXiv:2212.14532, 2023

  33. [34]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, pages 234–241, 2015

  34. [35]

    Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InAdvances in Neural Information Processing Systems, 2023

  35. [36]

    Prithvi wxc: Foun- dation model for weather and climate,

    Johannes Schmude, Sujit Roy, Paulina Trofimova, Karthik Ramesh, Bethany Lusch, Harikumar Kesa, Shraddha Singh, Phil Chen, Zhuohan Liu, Shubhankar Parashar, et al. Prithvi wxc: Foundation model for weather and climate.arXiv preprint arXiv:2409.13598, 2024

  36. [37]

    Stewart, Caleb Robinson, Isaac A

    Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data. InProceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2022

  37. [38]

    Bungert, Carsten T

    Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, and Paul F. Jäger. Overcoming common flaws in the evaluation of selective classification systems. InAdvances in Neural Information Processing Systems, 2024

  38. [39]

    Geological Survey and USDA Forest Service

    U.S. Geological Survey and USDA Forest Service. Monitoring Trends in Burn Severity (MTBS). https://www.mtbs.gov/, 2025. Accessed: 2026-05-05

  39. [40]

    Wildfire Risk to Communities: Housing Unit Density Image Service

    USDA Forest Service. Wildfire Risk to Communities: Housing Unit Density Image Service. https://catalog.data.gov/dataset/ wildfire-risk-to-communities-housing-unit-density-image-service-fac22 ,

  40. [41]

    Accessed: 2026-05-05

  41. [42]

    Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

    Jonathan A Weyn, Dale R Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere.Journal of Advances in Modeling Earth Systems, 12(9):e2020MS002109, 2020

  42. [43]

    Lobell, and Stefano Ermon

    Christopher Yeh, Chenlin Meng, Sijing Wang, Anne Driscoll, Erik Rozi, Peng Liu, Jae Yong Lee, Marshall Burke, David B. Lobell, and Stefano Ermon. SustainBench: Benchmarks for monitoring the sustainable development goals with machine learning. InAdvances in Neural Information Processing Systems, 2021. 12 Appendix Contents A Evaluation Contract Specificatio...