pith. sign in

arxiv: 2603.23043 · v2 · submitted 2026-03-24 · 💻 cs.LG · cs.AI

Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords climate emulatorsfoundation modelsout-of-distribution robustnessno-analog conditionsdistribution shiftsforcing trajectoriesclimate change
0
0 comments X

The pith

Even climate foundation models show sensitivity to forcing shifts after historical-only training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests the out-of-distribution robustness of climate emulators by restricting training of U-Net, ConvLSTM, and the ClimaX foundation model to 1850-2014 data only. It then evaluates performance on 2015-2023 temporal extrapolation and across divergent emission pathways to simulate no-analog future states. The analysis identifies an accuracy-stability trade-off in which ClimaX achieves the lowest absolute errors but records larger relative error increases under forcing changes, including up to 8.44 percent higher precipitation errors in extreme scenarios. A sympathetic reader would care because these efficient emulators are intended to replace slower Earth system models for future projections, yet their reliability depends on handling conditions outside the historical record. The results point to the need for scenario-aware training methods and stricter OOD testing protocols.

Core claim

When training is restricted to historical data from 1850-2014, the ClimaX foundation model attains the lowest absolute errors yet exhibits higher relative performance changes under no-analog distribution shifts created by temporal extrapolation to 2015-2023 and cross-scenario forcing shifts, with precipitation errors increasing by up to 8.44 percent under extreme emission pathways.

What carries the argument

The historical-only training regime (1850-2014) paired with temporal extrapolation and cross-scenario forcing shifts to generate no-analog test conditions for external forcing trajectories.

Load-bearing premise

Restricting training to 1850-2014 data creates a true no-analog regime without residual contamination from future scenarios in the underlying simulation datasets.

What would settle it

Observing relative error increases below 2 percent for ClimaX under extreme forcing scenarios in the same historical-only setup would indicate that the reported sensitivity is not driven by the no-analog protocol.

Figures

Figures reproduced from arXiv: 2603.23043 by Geng Li, Maria Conchita Agana Navarro, Maria Perez-Ortiz, Theo Wolf.

Figure 1
Figure 1. Figure 1: Comparison of surface air temperature (TAS) and precipitation (PR) test LL-RMSE for ML [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of surface air temperature (TAS) and precipitation (PR) test LL-RMSE for ML [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under "no-analog" future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks the out-of-distribution robustness of three climate emulator architectures (U-Net, ConvLSTM, and ClimaX) trained exclusively on historical data from 1850-2014. It evaluates performance under temporal extrapolation to 2015-2023 and cross-scenario forcing shifts across emission pathways, reporting that ClimaX achieves the lowest absolute errors but exhibits higher relative degradation, including up to an 8.44% increase in precipitation error under extreme scenarios. The central claim is that even high-capacity foundation models remain sensitive to no-analog external forcing trajectories when restricted to historical training dynamics, underscoring the need for scenario-aware training and rigorous OOD protocols.

Significance. If the historical-only isolation holds, the work supplies concrete empirical evidence of an accuracy-stability trade-off in ML climate emulators under distribution shifts. This is a useful contribution to the growing literature on foundation models for climate emulation, as it quantifies relative performance changes rather than absolute errors alone and highlights a concrete limitation that could guide future training strategies.

major comments (2)
  1. [Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.
  2. [Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.
minor comments (2)
  1. [Introduction] Clarify the exact definition of 'no-analog' regimes in the introduction, including how external forcing trajectories are quantified as outside the empirical range of 1850-2014 data.
  2. [Results] Add a table or figure summarizing absolute and relative errors across all models and scenarios for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the manuscript. We have revised the paper to provide additional details on the data pipeline and statistical analyses as requested. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.

    Authors: We agree that explicit documentation of the isolation protocol is essential. In the revised manuscript, the Methods section now includes a full description of the data pipeline: variables were masked to retain only historical-period statistics, normalization (mean and standard deviation) was computed exclusively over 1850-2014, and ClimaX was fine-tuned from its public pre-trained weights using solely the historical corpus with no future-scenario data leakage. All preprocessing steps are now enumerated to confirm strict historical-only training. revision: yes

  2. Referee: [Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.

    Authors: We accept this criticism and have augmented the Results section accordingly. The revised text reports 95% bootstrap confidence intervals around the relative error changes, specifies that all metrics are averaged over five independent ensemble members initialized with different random seeds, and includes paired t-test p-values confirming that the 8.44% precipitation degradation is statistically significant (p < 0.01) and insensitive to the tested normalization variants. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with held-out measurements

full rationale

The paper conducts direct empirical experiments: models are trained exclusively on 1850-2014 data and evaluated on later temporal windows and divergent scenarios. No equations, parameter fits, or derivations are presented whose outputs reduce to the inputs by construction. Claims rest on measured error deltas (e.g., 8.44% precipitation increase) from held-out test regimes rather than self-referential definitions or self-citation chains. The central result is therefore a set of observations, not a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmarking study that relies on standard supervised learning assumptions and publicly available climate simulation outputs; no new free parameters, axioms, or invented entities are introduced to support the central claim.

pith-pipeline@v0.9.0 · 5595 in / 1099 out tokens · 27130 ms · 2026-05-15T00:38:32.574458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation

    cs.LG 2026-05 conditional novelty 6.0

    ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distrib...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Addison, H., Kendon, E., Ravuri, S., Aitchison, L., and Watson, P. A. (2024). Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model. arXiv preprint arXiv:2407.14158

  2. [2]

    and Bellouin, N

    Arias, P. and Bellouin, N. e. a. (2021).Climate Change 2021: The Physical Science Basis. Contribu- tion of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, pages 33–144. Cambridge University Press, Cambridge, United Kingdom and New York, NY , USA

  3. [3]

    Gagnon-Audet, J.-C., Ahuja, K., Darvishi-Bayazi, M.-J., Mousavi, P., Dumas, G., and Rish, I. (2023). Woods: Benchmarks for out-of-distribution generalization in time series

  4. [4]

    Nowack, P., and Rolnick, D. (2023). Climateset: A large-scale climate model dataset for machine learning

  5. [5]

    J., Addison, H., Doury, A., Somot, S., Watson, P

    Kendon, E. J., Addison, H., Doury, A., Somot, S., Watson, P. A., Booth, B. B., Coppola, E., Gutiérrez, J. M., Murphy, J., and Scullion, C. (2025). Potential for machine learning emulators to augment regional climate simulations in provision of local climate change information.Bulletin of the American Meteorological Society, 106(6):E1175–E1203

  6. [6]

    Rasp, S., Düben, P., et al. (2024). Neural general circulation models for weather and climate. Nature, 632(8027):1060–1066

  7. [7]

    Liu, J., Wang, T., Cui, P., and Namkoong, H. (2025b). Out-of-distribution generalization in time series: A survey.arXiv preprint arXiv:2503.13868. Lütjens, B., Ferrari, R., Watson-Parris, D., and Selin, N. E. (2025). The impact of internal variability on benchmarking deep learning climate emulators.Journal of Advances in Modeling Earth Systems, 17(8):e202...

  8. [8]

    K., and Grover, A

    Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A. (2023). Climax: A foundation model for weather and climate

  9. [9]

    Abad, J., Chapman, W., Harder, P., and Gutiérrez, J. M. (2024). Enhancing regional climate downscaling through advances in machine learning.Artificial Intelligence for the Earth Systems, 3(2):230066

  10. [10]

    W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report

    Team, C. W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. IPCC, Geneva, Switzerland

  11. [11]

    Novitasari, M., Ricard, L., and Roesch, C. (2022). Climatebench v1.0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954. e2021MS002954 2021MS002954

  12. [12]

    K., Kwa, A., Perkins, W

    Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S. (2025). Ace2: accurately learning subseasonal to decadal atmospheric variability and forced responses.npj Climate and Atmospheric Science, 8(1):205. A Appendix Table 4: Baseline and Temporal Shift LL-RMSE values for each ML model: surfa...