Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

Geng Li; Maria Conchita Agana Navarro; Maria Perez-Ortiz; Theo Wolf

arxiv: 2603.23043 · v2 · submitted 2026-03-24 · 💻 cs.LG · cs.AI

Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

Maria Conchita Agana Navarro , Geng Li , Theo Wolf , Maria Perez-Ortiz This is my paper

Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords climate emulatorsfoundation modelsout-of-distribution robustnessno-analog conditionsdistribution shiftsforcing trajectoriesclimate change

0 comments

The pith

Even climate foundation models show sensitivity to forcing shifts after historical-only training

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests the out-of-distribution robustness of climate emulators by restricting training of U-Net, ConvLSTM, and the ClimaX foundation model to 1850-2014 data only. It then evaluates performance on 2015-2023 temporal extrapolation and across divergent emission pathways to simulate no-analog future states. The analysis identifies an accuracy-stability trade-off in which ClimaX achieves the lowest absolute errors but records larger relative error increases under forcing changes, including up to 8.44 percent higher precipitation errors in extreme scenarios. A sympathetic reader would care because these efficient emulators are intended to replace slower Earth system models for future projections, yet their reliability depends on handling conditions outside the historical record. The results point to the need for scenario-aware training methods and stricter OOD testing protocols.

Core claim

When training is restricted to historical data from 1850-2014, the ClimaX foundation model attains the lowest absolute errors yet exhibits higher relative performance changes under no-analog distribution shifts created by temporal extrapolation to 2015-2023 and cross-scenario forcing shifts, with precipitation errors increasing by up to 8.44 percent under extreme emission pathways.

What carries the argument

The historical-only training regime (1850-2014) paired with temporal extrapolation and cross-scenario forcing shifts to generate no-analog test conditions for external forcing trajectories.

Load-bearing premise

Restricting training to 1850-2014 data creates a true no-analog regime without residual contamination from future scenarios in the underlying simulation datasets.

What would settle it

Observing relative error increases below 2 percent for ClimaX under extreme forcing scenarios in the same historical-only setup would indicate that the reported sensitivity is not driven by the no-analog protocol.

Figures

Figures reproduced from arXiv: 2603.23043 by Geng Li, Maria Conchita Agana Navarro, Maria Perez-Ortiz, Theo Wolf.

**Figure 2.** Figure 2: Comparison of surface air temperature (TAS) and precipitation (PR) test LL-RMSE for ML [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under "no-analog" future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows climate emulators trained only on 1850-2014 data lose accuracy under future forcing shifts, with ClimaX showing up to 8.44% higher precipitation error, but the no-analog isolation needs explicit verification.

read the letter

The main thing to know is that restricting training to historical data makes even the ClimaX foundation model more sensitive to extreme emission scenarios than the absolute errors suggest. They report an 8.44% rise in precipitation error under those shifts while still getting the lowest raw numbers among the three models tested. That accuracy-stability trade-off is the clearest result here. What stands out is the deliberate historical-only protocol: they train U-Net, ConvLSTM, and ClimaX on 1850-2014 only, then evaluate both on 2015-2023 observations and on divergent CMIP-style scenarios. This avoids the usual contamination where models see future forcings during pre-training or normalization. The setup is incremental but useful because it forces a cleaner OOD test than most prior climate-ML benchmarks. The quantitative comparison across architectures is straightforward and directly relevant to anyone using these emulators for long-term projections. The soft spot is the data pipeline. The no-analog claim rests on the assumption that 1850-2014 slices contain no residual future information through shared statistics or pre-trained weights, yet the abstract gives no detail on how normalization, variable masking, or ClimaX fine-tuning was handled. If any leakage exists, the reported error increases could be smaller than claimed. The paper would also be stronger with error bars or run-to-run variability to show the 8.44% figure is stable. This is for people building or applying ML climate emulators who care about robustness under non-stationary forcing. It is not a foundational methods paper, but the experimental framing is clear enough that a serious referee should see it. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks the out-of-distribution robustness of three climate emulator architectures (U-Net, ConvLSTM, and ClimaX) trained exclusively on historical data from 1850-2014. It evaluates performance under temporal extrapolation to 2015-2023 and cross-scenario forcing shifts across emission pathways, reporting that ClimaX achieves the lowest absolute errors but exhibits higher relative degradation, including up to an 8.44% increase in precipitation error under extreme scenarios. The central claim is that even high-capacity foundation models remain sensitive to no-analog external forcing trajectories when restricted to historical training dynamics, underscoring the need for scenario-aware training and rigorous OOD protocols.

Significance. If the historical-only isolation holds, the work supplies concrete empirical evidence of an accuracy-stability trade-off in ML climate emulators under distribution shifts. This is a useful contribution to the growing literature on foundation models for climate emulation, as it quantifies relative performance changes rather than absolute errors alone and highlights a concrete limitation that could guide future training strategies.

major comments (2)

[Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.
[Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.

minor comments (2)

[Introduction] Clarify the exact definition of 'no-analog' regimes in the introduction, including how external forcing trajectories are quantified as outside the empirical range of 1850-2014 data.
[Results] Add a table or figure summarizing absolute and relative errors across all models and scenarios for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the manuscript. We have revised the paper to provide additional details on the data pipeline and statistical analyses as requested. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.

Authors: We agree that explicit documentation of the isolation protocol is essential. In the revised manuscript, the Methods section now includes a full description of the data pipeline: variables were masked to retain only historical-period statistics, normalization (mean and standard deviation) was computed exclusively over 1850-2014, and ClimaX was fine-tuned from its public pre-trained weights using solely the historical corpus with no future-scenario data leakage. All preprocessing steps are now enumerated to confirm strict historical-only training. revision: yes
Referee: [Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.

Authors: We accept this criticism and have augmented the Results section accordingly. The revised text reports 95% bootstrap confidence intervals around the relative error changes, specifies that all metrics are averaged over five independent ensemble members initialized with different random seeds, and includes paired t-test p-values confirming that the 8.44% precipitation degradation is statistically significant (p < 0.01) and insensitive to the tested normalization variants. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with held-out measurements

full rationale

The paper conducts direct empirical experiments: models are trained exclusively on 1850-2014 data and evaluated on later temporal windows and divergent scenarios. No equations, parameter fits, or derivations are presented whose outputs reduce to the inputs by construction. Claims rest on measured error deltas (e.g., 8.44% precipitation increase) from held-out test regimes rather than self-referential definitions or self-citation chains. The central result is therefore a set of observations, not a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmarking study that relies on standard supervised learning assumptions and publicly available climate simulation outputs; no new free parameters, axioms, or invented entities are introduced to support the central claim.

pith-pipeline@v0.9.0 · 5595 in / 1099 out tokens · 27130 ms · 2026-05-15T00:38:32.574458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we strictly isolate a historical-only training regime (1850–2014)... temporal extrapolation to the recent climate (2015–2023) and cross-scenario forcing shifts across divergent emission pathways (SSP1-2.6 and SSP5-8.5)... precipitation errors increasing by up to 8.44%
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ClimaX foundation model... fine-tuned to map the four input forcing agents to the atmospheric response variables

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
cs.LG 2026-05 conditional novelty 6.0

ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distrib...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., and Watson, P. A. (2024). Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model. arXiv preprint arXiv:2407.14158

work page arXiv 2024
[2]

and Bellouin, N

Arias, P. and Bellouin, N. e. a. (2021).Climate Change 2021: The Physical Science Basis. Contribu- tion of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, pages 33–144. Cambridge University Press, Cambridge, United Kingdom and New York, NY , USA

work page 2021
[3]

Gagnon-Audet, J.-C., Ahuja, K., Darvishi-Bayazi, M.-J., Mousavi, P., Dumas, G., and Rish, I. (2023). Woods: Benchmarks for out-of-distribution generalization in time series

work page 2023
[4]

Nowack, P., and Rolnick, D. (2023). Climateset: A large-scale climate model dataset for machine learning

work page 2023
[5]

J., Addison, H., Doury, A., Somot, S., Watson, P

Kendon, E. J., Addison, H., Doury, A., Somot, S., Watson, P. A., Booth, B. B., Coppola, E., Gutiérrez, J. M., Murphy, J., and Scullion, C. (2025). Potential for machine learning emulators to augment regional climate simulations in provision of local climate change information.Bulletin of the American Meteorological Society, 106(6):E1175–E1203

work page 2025
[6]

Rasp, S., Düben, P., et al. (2024). Neural general circulation models for weather and climate. Nature, 632(8027):1060–1066

work page 2024
[7]

Liu, J., Wang, T., Cui, P., and Namkoong, H. (2025b). Out-of-distribution generalization in time series: A survey.arXiv preprint arXiv:2503.13868. Lütjens, B., Ferrari, R., Watson-Parris, D., and Selin, N. E. (2025). The impact of internal variability on benchmarking deep learning climate emulators.Journal of Advances in Modeling Earth Systems, 17(8):e202...

work page internal anchor Pith review arXiv 2025
[8]

K., and Grover, A

Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A. (2023). Climax: A foundation model for weather and climate

work page 2023
[9]

Abad, J., Chapman, W., Harder, P., and Gutiérrez, J. M. (2024). Enhancing regional climate downscaling through advances in machine learning.Artificial Intelligence for the Earth Systems, 3(2):230066

work page 2024
[10]

W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report

Team, C. W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. IPCC, Geneva, Switzerland

work page 2023
[11]

Novitasari, M., Ricard, L., and Roesch, C. (2022). Climatebench v1.0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954. e2021MS002954 2021MS002954

work page 2022
[12]

K., Kwa, A., Perkins, W

Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S. (2025). Ace2: accurately learning subseasonal to decadal atmospheric variability and forced responses.npj Climate and Atmospheric Science, 8(1):205. A Appendix Table 4: Baseline and Temporal Shift LL-RMSE values for each ML model: surfa...

work page 2025

[1] [1]

Addison, H., Kendon, E., Ravuri, S., Aitchison, L., and Watson, P. A. (2024). Machine learning emulation of precipitation from km-scale regional climate simulations using a diffusion model. arXiv preprint arXiv:2407.14158

work page arXiv 2024

[2] [2]

and Bellouin, N

Arias, P. and Bellouin, N. e. a. (2021).Climate Change 2021: The Physical Science Basis. Contribu- tion of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, pages 33–144. Cambridge University Press, Cambridge, United Kingdom and New York, NY , USA

work page 2021

[3] [3]

Gagnon-Audet, J.-C., Ahuja, K., Darvishi-Bayazi, M.-J., Mousavi, P., Dumas, G., and Rish, I. (2023). Woods: Benchmarks for out-of-distribution generalization in time series

work page 2023

[4] [4]

Nowack, P., and Rolnick, D. (2023). Climateset: A large-scale climate model dataset for machine learning

work page 2023

[5] [5]

J., Addison, H., Doury, A., Somot, S., Watson, P

Kendon, E. J., Addison, H., Doury, A., Somot, S., Watson, P. A., Booth, B. B., Coppola, E., Gutiérrez, J. M., Murphy, J., and Scullion, C. (2025). Potential for machine learning emulators to augment regional climate simulations in provision of local climate change information.Bulletin of the American Meteorological Society, 106(6):E1175–E1203

work page 2025

[6] [6]

Rasp, S., Düben, P., et al. (2024). Neural general circulation models for weather and climate. Nature, 632(8027):1060–1066

work page 2024

[7] [7]

Liu, J., Wang, T., Cui, P., and Namkoong, H. (2025b). Out-of-distribution generalization in time series: A survey.arXiv preprint arXiv:2503.13868. Lütjens, B., Ferrari, R., Watson-Parris, D., and Selin, N. E. (2025). The impact of internal variability on benchmarking deep learning climate emulators.Journal of Advances in Modeling Earth Systems, 17(8):e202...

work page internal anchor Pith review arXiv 2025

[8] [8]

K., and Grover, A

Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A. (2023). Climax: A foundation model for weather and climate

work page 2023

[9] [9]

Abad, J., Chapman, W., Harder, P., and Gutiérrez, J. M. (2024). Enhancing regional climate downscaling through advances in machine learning.Artificial Intelligence for the Earth Systems, 3(2):230066

work page 2024

[10] [10]

W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report

Team, C. W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. IPCC, Geneva, Switzerland

work page 2023

[11] [11]

Novitasari, M., Ricard, L., and Roesch, C. (2022). Climatebench v1.0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954. e2021MS002954 2021MS002954

work page 2022

[12] [12]

K., Kwa, A., Perkins, W

Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S. (2025). Ace2: accurately learning subseasonal to decadal atmospheric variability and forced responses.npj Climate and Atmospheric Science, 8(1):205. A Appendix Table 4: Baseline and Temporal Shift LL-RMSE values for each ML model: surfa...

work page 2025