ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

Denis E. Sergeev; Edward T. Stevenson; Eric Wolf; Mei Ting Mak; Miles Cranmer; N. J. Mayne; Tobi Hammond

arxiv: 2606.18338 · v1 · pith:54F72TFHnew · submitted 2026-06-16 · 💻 cs.LG · astro-ph.EP· astro-ph.IM

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

Edward T. Stevenson , Mei Ting Mak , Eric Wolf , Denis E. Sergeev , Tobi Hammond , N. J. Mayne , Miles Cranmer This is my paper

Pith reviewed 2026-06-27 01:32 UTC · model grok-4.3

classification 💻 cs.LG astro-ph.EPastro-ph.IM

keywords exoplanet climate emulationglobal climate model benchmarkparameter-to-field regressionmulti-simulator datasethabitable exoplanetsatmospheric fieldsGaussian process emulation

0 comments

The pith

ThousandWorlds supplies roughly 1800 simulations from five climate models to train emulators that map eight planet parameters to 3D atmospheric fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ThousandWorlds as a curated dataset and benchmark for machine-learning emulation of exoplanet climates. It draws from five global climate models to generate approximately 1800 runs that link eight input parameters to full three-dimensional fields of temperature, humidity, winds, clouds, and radiation. Three nested task levels progress from single-model regression to multi-model regression with missing data. Evaluation protocols include direct ranking of methods and comparison against the spread among the climate models themselves. Baseline tests indicate that Gaussian-process approaches currently outperform standard deep-learning methods on this collection.

Core claim

We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs thems

What carries the argument

The ThousandWorlds dataset, which assembles runs from five GCMs into three nested regression tasks that map eight planet parameters onto 3D atmospheric fields.

If this is right

Emulators can be ranked by how closely they reproduce fields within one model and how well they stay inside the spread across models.
Gaussian-process methods appear better suited than current deep networks for this parameter-to-field mapping under limited data.
The benchmark supplies a concrete testbed for handling structured missingness across multiple simulators.
Progress on the hardest task level directly reduces the cost of exploring habitable-zone climates before running new full simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset structure could be reused to test whether emulators trained on synthetic climates improve interpretation of real atmospheric spectra once those spectra become available.
Adding runs from additional climate models would tighten the inter-model disagreement baseline and expose whether current GP superiority holds under greater simulator diversity.
The missingness task level offers a natural probe for methods that must infer fields when only partial vertical or horizontal data are supplied, a situation likely to arise with sparse observations.

Load-bearing premise

The five chosen GCMs and the sampled planet-parameter space produce a representative enough ensemble that performance on the benchmark will translate to useful emulation on real exoplanet observations.

What would settle it

An emulator achieving high scores on ThousandWorlds yet producing temperature or cloud fields that systematically diverge from independent GCM runs or from actual telescope spectra of known exoplanets would show the benchmark does not capture the needed generalization.

Figures

Figures reproduced from arXiv: 2606.18338 by Denis E. Sergeev, Edward T. Stevenson, Eric Wolf, Mei Ting Mak, Miles Cranmer, N. J. Mayne, Tobi Hammond.

**Figure 2.** Figure 2: Per-planet relative RMSE distribution for GPLFR on Multi-partial, grouped by variable. [PITH_FULL_IMAGE:figures/full_fig_p033_2.png] view at source ↗

**Figure 3.** Figure 3: Spatial maps of temperature with superimposed wind vectors at relative isobar [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗

**Figure 4.** Figure 4: Absorbed shortwave radiation (ASR) maps for four test planets. [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗

**Figure 5.** Figure 5: Dayside and nightside vertical profiles of area-weighted mean temperature and specific [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗

**Figure 6.** Figure 6: Predicted (y-axis) versus true (x-axis) values for six climate diagnostics across the Multi [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗

read the original abstract

The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: https://doi.org/10.57967/hf/8695. Code: https://github.com/edstevenson/ThousandWorlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThousandWorlds is the first public multi-GCM exoclimate dataset with defined ML benchmark tasks, which is a concrete and useful release even if its real-world transfer remains to be shown.

read the letter

The main thing to know is that this paper releases ThousandWorlds, a dataset of roughly 1800 simulations from five GCMs that map eight planet parameters to 3D atmospheric fields, packaged with three nested tasks and two evaluation protocols. That directly addresses the gap they flag in the abstract.

The work does what a data-release paper should: it supplies the DOI and GitHub link, spells out the single-simulator, multi-simulator, and missing-data tiers, and reports baseline numbers showing Gaussian processes beating the deep-learning entries. The protocol that scores emulators against GCM disagreement is a reasonable way to ground the numbers without circularity. Those elements are new and give people a shared testbed they did not have before.

The softer spots are proportionate. The claim that success here will help with actual telescope spectra rests on the five GCMs and sampled parameter space being representative enough; the paper cannot prove that, and downstream users will have to check it. Without the full text it is also impossible to verify simulation quality controls or the precise missingness mechanism, though the abstract states they are documented. These are standard limitations for a benchmark release rather than load-bearing flaws.

The paper is for researchers who build or test emulators for exoclimate or other multi-simulator, low-data regression problems. Anyone already working in that intersection will get immediate use from the resource and the task definitions.

I would send it to peer review. The dataset and protocols are new, the baselines are reported, and the contribution is self-contained enough to be worth referee time.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces ThousandWorlds, an ML-ready benchmark dataset for exoclimate emulation consisting of approximately 1800 simulations from five GCMs that map eight planet parameters to 3D atmospheric fields (temperature, humidity, winds, clouds, radiation). It defines three nested task subsets (single-simulator regression, multi-simulator with complete observations, multi-simulator with structured missingness), two evaluation protocols (one for method ranking and one measuring performance relative to inter-GCM disagreement), and reports baseline results from seven methods (simple, deep learning, and GP) where GP-based approaches perform best.

Significance. If the released resource matches the description, this addresses a documented gap by providing a public, multi-GCM, parameter-to-field dataset with DOI and code repository. The nested tasks, structured missingness, and evaluation against GCM disagreement supply a concrete, falsifiable benchmark for low-data multi-simulator regression that extends beyond single-model emulation. Credit is due for the data release and the explicit comparison to inter-model spread rather than to self-derived quantities.

minor comments (2)

[abstract / §3] The abstract states 'approximately 1800 simulations' without breaking down the count per GCM or per task subset; adding a table or explicit counts in §3 would improve reproducibility of the baseline splits.
[§4] The description of the 'structured missingness' structure in the third task is referenced but not illustrated with an example mask or pseudocode; a small figure or listing in the methods would clarify the protocol for downstream users.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment. The report accurately summarizes the contribution and recommends acceptance with no major comments.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a data-release and benchmarking paper whose central claim is the public release of ~1800 multi-GCM simulations together with three nested task definitions and two evaluation protocols. No derivations, equations, fitted parameters, or self-citations are invoked as load-bearing steps in any claimed prediction or uniqueness result. The evaluation protocols compare emulators directly to inter-GCM disagreement rather than to quantities derived from the same fitted objects, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is a dataset and benchmark release rather than a theoretical derivation; no free parameters are fitted, no background axioms are invoked beyond standard climate-model usage, and no new physical entities are postulated.

pith-pipeline@v0.9.1-grok · 5835 in / 1292 out tokens · 41215 ms · 2026-06-27T01:32:24.600737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Wolf, Giada Arney, Natasha E

ISSN 0004-637X. doi: 10.3847/1538-4357/aa7cf9. URL https://dx.doi.org/10. 3847/1538-4357/aa7cf9. F. H. Lambert, P. G. Challenor, N. T. Lewis, D. J. McNeall, N. Owen, I. A. Boutle, H. M. Chris- tensen, R. J. Keane, N. J. Mayne, A. Stirling, and M. J. Webb. Continuous structural param- eterization: A proposed method for representing different model paramete...

work page doi:10.3847/1538-4357/aa7cf9 2020
[2]

Climate Transition to Temperate Nightside at High Atmosphere Mass

ISSN 0004-637X. doi: 10.3847/1538-4357/adb0cb. URL https://dx.doi.org/10. 3847/1538-4357/adb0cb. Mei Ting Mak, Denis Sergeev, Nathan Mayne, Nahum Banks, Jake Eager-Nash, James Manners, Giada Arney, Eric Hebrard, and Krisztian Kohary. 3D simulations of TRAPPIST-1e with varying CO2, CH4 and haze profiles.Monthly Notices of the Royal Astronomical Society, 52...

work page doi:10.3847/1538-4357/adb0cb 2024
[3]

Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee

URLhttp://arxiv.org/abs/2412.00568. Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee. A Large Repository of 3D Climate Model Outputs for Community Analysis and Postprocessing, December

arXiv
[4]

Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure

URLhttp://arxiv.org/abs/2008.02339. Adiv Paradise, Bo Lin Fan, Kristen Menou, and Christopher Lee. Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure.Icarus, 358:114301, April 2021. ISSN 00191035. doi: 10.1016/j.icarus.2020.114301. URLhttp://arxiv.org/abs/1910.02355. 14 Adiv Paradise, Evelyn Macdonald, Kristen Menou,...

work page doi:10.1016/j.icarus.2020.114301 2008
[5]

and Scher, Sebastian and Weyn, Jonathan A

doi: 10.1029/2020MS002203. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2020MS002203. Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron B...

work page doi:10.1029/2020ms002203 2024
[6]

Gabrielle Suissa, Eric T

URLhttp://arxiv.org/abs/2501.19374. Gabrielle Suissa, Eric T. Wolf, Ravi kumar Kopparapu, Geronimo L. Villanueva, Thomas Fauchez, Avi M. Mandell, Giada Arney, Emily A. Gilbert, Joshua E. Schlieder, Thomas Barclay, Elisa V . Quintana, Eric Lopez, Joseph E. Rodriguez, and Andrew Vanderburg. The First Habitable-zone Earth-sized Planet from TESS. III. Climate...

work page doi:10.3847/1538-3881/ 2020
[7]

URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954

doi: 10.1029/2021MS002954. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954. E. T. Wolf, R. K. Kopparapu, and J. Haqq-Misra. Simulated Phase-dependent Spectra of Terrestrial Aquaplanets in M Dwarf Systems.The Astrophysical Journal, 877(1):35, May 2019. ISSN 0004- 637X. doi: 10.3847/1538-4357/ab184a. URL https://dx.doi.org/10.3847/1538-435...

work page doi:10.1029/2021ms002954 2019
[8]

Abiotic oxygen-dominated atmospheres on terrestrial habitable zone planets

doi: 10.3847/PSJ/ae031e. URL https://iopscience.iop.org/article/10.3847/ PSJ/ae031e. Nigel Wood, Andrew Staniforth, Andy White, Thomas Allen, Michail Diamantakis, Markus Gross, Thomas Melvin, Chris Smith, Simon V osper, Mohamed Zerroukat, and John Thuburn. An inherently mass-conserving semi-implicit semi-Lagrangian discretization of the deep-atmosphere gl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3847/psj/ae031e 2014
[9]

doi: 10.48550/arXiv.2402.14535. G. J. Zhang and N. A. McFarlane. Sensitivity of climate simulations to the parameterization of cumulus convection in the Canadian climate centre general circulation model.Atmosphere-Ocean, 33:407–446, 1995. doi: 10.1080/07055900.1995.9649539. 18 A Dataset details A.1 Sampling design The bespoke simulations in Table 2 were s...

work page doi:10.48550/arxiv.2402.14535 1995
[10]

This assumes output dimensions are conditionally independent given the latents

Learned field–field correlations.The default GPLFR output coregionalization matrix is B=I Dy, where Dy is the output dimensionality. This assumes output dimensions are conditionally independent given the latents. This is reasonable across spectral coefficients, which are approximately uncorrelated by construction, but restrictive across physical fields – ...
[11]

winds” collects both E–W and N–S winds, and “radiation

Variable-group weights.The GPLFR likelihood treats all output dimensions equally by default. However, different physical quantities differ in their predictability, so equal weighting may not allocate modelling capacity efficiently. To address this, we introduce a learned weight per variable group, where groups collect variables that we expect to have broa...

arXiv 2022
[12]

Limitations

match these claims. Scope limitations (tidally locked waterworlds only, low-data regime) are stated explicitly in the Introduction and Dataset sections (Sections 1, 3). Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made...

work page doi:10.57967/hf/8695
[13]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

[1] [1]

Wolf, Giada Arney, Natasha E

ISSN 0004-637X. doi: 10.3847/1538-4357/aa7cf9. URL https://dx.doi.org/10. 3847/1538-4357/aa7cf9. F. H. Lambert, P. G. Challenor, N. T. Lewis, D. J. McNeall, N. Owen, I. A. Boutle, H. M. Chris- tensen, R. J. Keane, N. J. Mayne, A. Stirling, and M. J. Webb. Continuous structural param- eterization: A proposed method for representing different model paramete...

work page doi:10.3847/1538-4357/aa7cf9 2020

[2] [2]

Climate Transition to Temperate Nightside at High Atmosphere Mass

ISSN 0004-637X. doi: 10.3847/1538-4357/adb0cb. URL https://dx.doi.org/10. 3847/1538-4357/adb0cb. Mei Ting Mak, Denis Sergeev, Nathan Mayne, Nahum Banks, Jake Eager-Nash, James Manners, Giada Arney, Eric Hebrard, and Krisztian Kohary. 3D simulations of TRAPPIST-1e with varying CO2, CH4 and haze profiles.Monthly Notices of the Royal Astronomical Society, 52...

work page doi:10.3847/1538-4357/adb0cb 2024

[3] [3]

Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee

URLhttp://arxiv.org/abs/2412.00568. Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee. A Large Repository of 3D Climate Model Outputs for Community Analysis and Postprocessing, December

arXiv

[4] [4]

Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure

URLhttp://arxiv.org/abs/2008.02339. Adiv Paradise, Bo Lin Fan, Kristen Menou, and Christopher Lee. Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure.Icarus, 358:114301, April 2021. ISSN 00191035. doi: 10.1016/j.icarus.2020.114301. URLhttp://arxiv.org/abs/1910.02355. 14 Adiv Paradise, Evelyn Macdonald, Kristen Menou,...

work page doi:10.1016/j.icarus.2020.114301 2008

[5] [5]

and Scher, Sebastian and Weyn, Jonathan A

doi: 10.1029/2020MS002203. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2020MS002203. Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron B...

work page doi:10.1029/2020ms002203 2024

[6] [6]

Gabrielle Suissa, Eric T

URLhttp://arxiv.org/abs/2501.19374. Gabrielle Suissa, Eric T. Wolf, Ravi kumar Kopparapu, Geronimo L. Villanueva, Thomas Fauchez, Avi M. Mandell, Giada Arney, Emily A. Gilbert, Joshua E. Schlieder, Thomas Barclay, Elisa V . Quintana, Eric Lopez, Joseph E. Rodriguez, and Andrew Vanderburg. The First Habitable-zone Earth-sized Planet from TESS. III. Climate...

work page doi:10.3847/1538-3881/ 2020

[7] [7]

URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954

doi: 10.1029/2021MS002954. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954. E. T. Wolf, R. K. Kopparapu, and J. Haqq-Misra. Simulated Phase-dependent Spectra of Terrestrial Aquaplanets in M Dwarf Systems.The Astrophysical Journal, 877(1):35, May 2019. ISSN 0004- 637X. doi: 10.3847/1538-4357/ab184a. URL https://dx.doi.org/10.3847/1538-435...

work page doi:10.1029/2021ms002954 2019

[8] [8]

Abiotic oxygen-dominated atmospheres on terrestrial habitable zone planets

doi: 10.3847/PSJ/ae031e. URL https://iopscience.iop.org/article/10.3847/ PSJ/ae031e. Nigel Wood, Andrew Staniforth, Andy White, Thomas Allen, Michail Diamantakis, Markus Gross, Thomas Melvin, Chris Smith, Simon V osper, Mohamed Zerroukat, and John Thuburn. An inherently mass-conserving semi-implicit semi-Lagrangian discretization of the deep-atmosphere gl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3847/psj/ae031e 2014

[9] [9]

doi: 10.48550/arXiv.2402.14535. G. J. Zhang and N. A. McFarlane. Sensitivity of climate simulations to the parameterization of cumulus convection in the Canadian climate centre general circulation model.Atmosphere-Ocean, 33:407–446, 1995. doi: 10.1080/07055900.1995.9649539. 18 A Dataset details A.1 Sampling design The bespoke simulations in Table 2 were s...

work page doi:10.48550/arxiv.2402.14535 1995

[10] [10]

This assumes output dimensions are conditionally independent given the latents

Learned field–field correlations.The default GPLFR output coregionalization matrix is B=I Dy, where Dy is the output dimensionality. This assumes output dimensions are conditionally independent given the latents. This is reasonable across spectral coefficients, which are approximately uncorrelated by construction, but restrictive across physical fields – ...

[11] [11]

winds” collects both E–W and N–S winds, and “radiation

Variable-group weights.The GPLFR likelihood treats all output dimensions equally by default. However, different physical quantities differ in their predictability, so equal weighting may not allocate modelling capacity efficiently. To address this, we introduce a learned weight per variable group, where groups collect variables that we expect to have broa...

arXiv 2022

[12] [12]

Limitations

match these claims. Scope limitations (tidally locked waterworlds only, low-data regime) are stated explicitly in the Introduction and Dataset sections (Sections 1, 3). Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made...

work page doi:10.57967/hf/8695

[13] [13]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...