Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
Pith reviewed 2026-05-15 00:38 UTC · model grok-4.3
The pith
Even climate foundation models show sensitivity to forcing shifts after historical-only training
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When training is restricted to historical data from 1850-2014, the ClimaX foundation model attains the lowest absolute errors yet exhibits higher relative performance changes under no-analog distribution shifts created by temporal extrapolation to 2015-2023 and cross-scenario forcing shifts, with precipitation errors increasing by up to 8.44 percent under extreme emission pathways.
What carries the argument
The historical-only training regime (1850-2014) paired with temporal extrapolation and cross-scenario forcing shifts to generate no-analog test conditions for external forcing trajectories.
Load-bearing premise
Restricting training to 1850-2014 data creates a true no-analog regime without residual contamination from future scenarios in the underlying simulation datasets.
What would settle it
Observing relative error increases below 2 percent for ClimaX under extreme forcing scenarios in the same historical-only setup would indicate that the reported sensitivity is not driven by the no-analog protocol.
Figures
read the original abstract
The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under "no-analog" future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks the out-of-distribution robustness of three climate emulator architectures (U-Net, ConvLSTM, and ClimaX) trained exclusively on historical data from 1850-2014. It evaluates performance under temporal extrapolation to 2015-2023 and cross-scenario forcing shifts across emission pathways, reporting that ClimaX achieves the lowest absolute errors but exhibits higher relative degradation, including up to an 8.44% increase in precipitation error under extreme scenarios. The central claim is that even high-capacity foundation models remain sensitive to no-analog external forcing trajectories when restricted to historical training dynamics, underscoring the need for scenario-aware training and rigorous OOD protocols.
Significance. If the historical-only isolation holds, the work supplies concrete empirical evidence of an accuracy-stability trade-off in ML climate emulators under distribution shifts. This is a useful contribution to the growing literature on foundation models for climate emulation, as it quantifies relative performance changes rather than absolute errors alone and highlights a concrete limitation that could guide future training strategies.
major comments (2)
- [Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.
- [Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.
minor comments (2)
- [Introduction] Clarify the exact definition of 'no-analog' regimes in the introduction, including how external forcing trajectories are quantified as outside the empirical range of 1850-2014 data.
- [Results] Add a table or figure summarizing absolute and relative errors across all models and scenarios for direct comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us strengthen the manuscript. We have revised the paper to provide additional details on the data pipeline and statistical analyses as requested. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: The central no-analog claim rests on training being strictly restricted to 1850-2014 data with no residual contamination from future scenarios. However, the manuscript provides no description of the data pipeline, variable masking, normalization statistics, or whether ClimaX foundation weights were frozen or fine-tuned from a multi-scenario corpus. This isolation is load-bearing for interpreting the reported 8.44% precipitation error increase as true OOD sensitivity rather than partial in-distribution exposure.
Authors: We agree that explicit documentation of the isolation protocol is essential. In the revised manuscript, the Methods section now includes a full description of the data pipeline: variables were masked to retain only historical-period statistics, normalization (mean and standard deviation) was computed exclusively over 1850-2014, and ClimaX was fine-tuned from its public pre-trained weights using solely the historical corpus with no future-scenario data leakage. All preprocessing steps are now enumerated to confirm strict historical-only training. revision: yes
-
Referee: [Results] Results section: The 8.44% precipitation error increase under extreme forcing is presented without statistical details such as confidence intervals, number of ensemble members, or hypothesis tests. Without these, it is not possible to determine whether the relative change is robust or sensitive to particular preprocessing choices.
Authors: We accept this criticism and have augmented the Results section accordingly. The revised text reports 95% bootstrap confidence intervals around the relative error changes, specifies that all metrics are averaged over five independent ensemble members initialized with different random seeds, and includes paired t-test p-values confirming that the 8.44% precipitation degradation is statistically significant (p < 0.01) and insensitive to the tested normalization variants. revision: yes
Circularity Check
No circularity: empirical benchmarking with held-out measurements
full rationale
The paper conducts direct empirical experiments: models are trained exclusively on 1850-2014 data and evaluated on later temporal windows and divergent scenarios. No equations, parameter fits, or derivations are presented whose outputs reduce to the inputs by construction. Claims rest on measured error deltas (e.g., 8.44% precipitation increase) from held-out test regimes rather than self-referential definitions or self-citation chains. The central result is therefore a set of observations, not a closed logical loop.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we strictly isolate a historical-only training regime (1850–2014)... temporal extrapolation to the recent climate (2015–2023) and cross-scenario forcing shifts across divergent emission pathways (SSP1-2.6 and SSP5-8.5)... precipitation errors increasing by up to 8.44%
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ClimaX foundation model... fine-tuned to map the four input forcing agents to the atmospheric response variables
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distrib...
Reference graph
Works this paper leans on
- [1]
-
[2]
Arias, P. and Bellouin, N. e. a. (2021).Climate Change 2021: The Physical Science Basis. Contribu- tion of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, pages 33–144. Cambridge University Press, Cambridge, United Kingdom and New York, NY , USA
work page 2021
-
[3]
Gagnon-Audet, J.-C., Ahuja, K., Darvishi-Bayazi, M.-J., Mousavi, P., Dumas, G., and Rish, I. (2023). Woods: Benchmarks for out-of-distribution generalization in time series
work page 2023
-
[4]
Nowack, P., and Rolnick, D. (2023). Climateset: A large-scale climate model dataset for machine learning
work page 2023
-
[5]
J., Addison, H., Doury, A., Somot, S., Watson, P
Kendon, E. J., Addison, H., Doury, A., Somot, S., Watson, P. A., Booth, B. B., Coppola, E., Gutiérrez, J. M., Murphy, J., and Scullion, C. (2025). Potential for machine learning emulators to augment regional climate simulations in provision of local climate change information.Bulletin of the American Meteorological Society, 106(6):E1175–E1203
work page 2025
-
[6]
Rasp, S., Düben, P., et al. (2024). Neural general circulation models for weather and climate. Nature, 632(8027):1060–1066
work page 2024
-
[7]
Liu, J., Wang, T., Cui, P., and Namkoong, H. (2025b). Out-of-distribution generalization in time series: A survey.arXiv preprint arXiv:2503.13868. Lütjens, B., Ferrari, R., Watson-Parris, D., and Selin, N. E. (2025). The impact of internal variability on benchmarking deep learning climate emulators.Journal of Advances in Modeling Earth Systems, 17(8):e202...
work page internal anchor Pith review arXiv 2025
-
[8]
Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A. (2023). Climax: A foundation model for weather and climate
work page 2023
-
[9]
Abad, J., Chapman, W., Harder, P., and Gutiérrez, J. M. (2024). Enhancing regional climate downscaling through advances in machine learning.Artificial Intelligence for the Earth Systems, 3(2):230066
work page 2024
-
[10]
W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report
Team, C. W., Lee, H., and Romero, J., editors (2023).Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. IPCC, Geneva, Switzerland
work page 2023
-
[11]
Novitasari, M., Ricard, L., and Roesch, C. (2022). Climatebench v1.0: A benchmark for data-driven climate projections.Journal of Advances in Modeling Earth Systems, 14(10):e2021MS002954. e2021MS002954 2021MS002954
work page 2022
-
[12]
Watt-Meyer, O., Henn, B., McGibbon, J., Clark, S. K., Kwa, A., Perkins, W. A., Wu, E., Harris, L., and Bretherton, C. S. (2025). Ace2: accurately learning subseasonal to decadal atmospheric variability and forced responses.npj Climate and Atmospheric Science, 8(1):205. A Appendix Table 4: Baseline and Temporal Shift LL-RMSE values for each ML model: surfa...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.