Simulation based inference of the ionization history from the 2D 21 cm power spectrum

Carina Norregaard; Jonathan R. Pritchard; Nadia Cooper; Romain Meriot

arxiv: 2508.16329 · v2 · submitted 2025-08-22 · 🌌 astro-ph.CO

Simulation based inference of the ionization history from the 2D 21 cm power spectrum

Nadia Cooper , Carina Norregaard , Romain Meriot , Jonathan R. Pritchard This is my paper

Pith reviewed 2026-05-18 21:51 UTC · model grok-4.3

classification 🌌 astro-ph.CO

keywords 21 cm signalreionizationsimulation-based inferenceemulatorionization historySKApower spectrumCosmic Dawn

0 comments

The pith

Simulation-based inference recovers the ionization history from 21 cm 2D power spectra, but emulators of the statistic may degrade performance due to its stochastic nature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains an emulator on 21cmFAST simulations to generate 21 cm 2D power spectra that include expected SKA noise and sample variance. It then applies simulation-based inference with neural posterior estimation to recover both astrophysical parameters and the neutral hydrogen fraction over redshift 5 to 12. Direct comparison shows that training on either raw simulations or emulator outputs can produce accurate posteriors for the ionization history. Coverage tests nevertheless reveal that adding emulator-generated samples brings no improvement and may worsen calibration. The authors conclude that the inherent randomness in the 2D power spectrum makes emulation of this particular summary statistic less reliable for inference.

Core claim

Neural posterior estimation applied to noisy 21 cm 2D power spectra obtained either directly from 21cmFAST or from a trained emulator both recover the ionization history and astrophysical parameters to good accuracy; however, coverage diagnostics show that supplementing the training set with emulated samples does not improve, and can degrade, the quality of the posterior estimates, indicating that the stochastic character of the 2DPS summary statistic limits the utility of emulation in this setting.

What carries the argument

Neural posterior estimation (NPE) within simulation-based inference (SBI) applied to the 21 cm 2D power spectrum (2DPS) as the summary statistic, using an emulator trained on 21cmFAST runs that incorporate SKA instrumental noise and cosmic sample variance.

If this is right

The ionization history between redshift 5 and 12 can be recovered from SKA 2D power spectra without requiring full forward modeling at every step.
Astrophysical parameters governing the first stars and reionization can be jointly constrained alongside the neutral fraction.
Emulation of the 2DPS should be used with caution in SBI pipelines because it does not improve coverage and may introduce additional bias from the stochastic nature of the statistic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If a less stochastic summary statistic such as the spherically averaged power spectrum were substituted, emulator-based training might regain its computational advantage.
The same SBI framework could be tested on other 21 cm observables to identify which statistics tolerate emulation without loss of calibration.
Extending the redshift range or including additional instrumental systematics would provide a direct test of how sensitive the coverage degradation is to the degree of stochasticity.

Load-bearing premise

The 21cmFAST simulations and the emulator trained on them must accurately reproduce the true 21 cm 2D power spectrum, including all relevant astrophysics, sample variance, and SKA noise.

What would settle it

A direct comparison between the ionization history inferred from actual SKA 21 cm observations and independent constraints from CMB optical depth or Lyman-alpha forest data that shows a statistically significant mismatch would falsify the claim that the SBI pipeline recovers the correct history.

read the original abstract

The 21 cm signal contains a wealth of information about the formation of the first stars and the reionization of the intergalactic medium during the Cosmic Dawn (CD) and Epoch of Reionization (EoR). The timing of these important milestones has only roughly been constrained through indirect measurements, such as from the cosmic microwave background (CMB) optical depth, and Lyman-$\alpha$ forest. Therefore, inferring the neutral fraction over cosmic time is a goal of upcoming 21 cm experiments, such as the Square Kilometer Array (SKA). We contrast two approaches to infer astrophysical parameters and ionization history from 21 cm 2D power spectra (2DPS). We develop an emulator of the 21 cm 2DPS, trained on 21cmFAST simulations, taking into account the expected instrumental noise from the SKA and sample variance. We then perform simulation based inference (SBI) using neural posterior estimation (NPE). We compare training on datasets of noisy 2DPS obtained from 21cmFAST simulations and an emulator, to infer astrophysical parameters of interest. Using an emulator of the ionization history, which has been trained on simulations from the same astrophysical parameters, we then obtain posterior distributions of the ionization history over the redshift range z $\sim$ 5-12. We demonstrate that both methods are capable of accurately recovering the ionization history and astrophysical parameters. However, coverage tests indicate that adding emulated samples does not improve predictions. This work suggests that due to the stochastic nature of the 2DPS, using an emulator of this summary statistic may result in poorer inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops an emulator for the 21 cm 2D power spectrum (2DPS) trained on 21cmFAST simulations that incorporates expected SKA instrumental noise and sample variance. It then performs simulation-based inference (SBI) with neural posterior estimation (NPE) to recover astrophysical parameters and the ionization history over z ~ 5-12. The central claim is that both direct-simulation and emulator-based training recover the ionization history and parameters accurately, yet coverage tests show that augmenting with emulated samples yields no improvement and may produce poorer inference owing to the stochastic character of the 2DPS.

Significance. If the quantitative results hold, the work is significant for 21 cm cosmology because it supplies a concrete demonstration of SBI for inferring the reionization history—a primary science target for SKA—and supplies a cautionary empirical result on the use of emulators for inherently stochastic summary statistics. The direct comparison of two training regimes is a methodological strength that could inform pipeline design for upcoming intensity-mapping experiments.

major comments (3)

[Methods (emulator training and SBI pipeline)] The central claim that emulator-augmented training does not improve (and may degrade) inference rests on the premise that both 21cmFAST and the trained emulator faithfully reproduce the full distribution of the noisy 2DPS, including cosmic variance and SKA noise. The manuscript provides no explicit validation (e.g., power-spectrum residuals, variance matching, or cross-code comparison) of this distributional fidelity for the stochastic components; this assumption is load-bearing for interpreting the coverage-test outcome.
[Results (coverage tests)] Coverage tests are invoked to support the negative result on emulated samples, yet the text does not report the numerical coverage probabilities, the number of test realizations, error bars on the coverage statistic, or the precise definition of the coverage metric employed. Without these quantities it is difficult to judge the statistical significance of the claimed lack of improvement.
[Results (posterior recovery)] The claim of “accurate recovery” of astrophysical parameters and ionization history is stated without a quantitative side-by-side comparison (posterior widths, calibration scores, or posterior-predictive checks) between the direct-simulation and emulator-trained NPE models. Such metrics are needed to substantiate the assertion that emulator use results in poorer inference.

minor comments (2)

[Methods] Specify the neural-network architecture, training loss, and hyper-parameter choices for both the 2DPS emulator and the NPE network; this information is required for reproducibility.
[Figures] In figures displaying recovered posteriors, overlay the true parameter values and indicate the 68 % and 95 % credible intervals for direct visual assessment of calibration.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive comments, which have helped us identify areas where additional detail will strengthen the presentation. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Methods (emulator training and SBI pipeline)] The central claim that emulator-augmented training does not improve (and may degrade) inference rests on the premise that both 21cmFAST and the trained emulator faithfully reproduce the full distribution of the noisy 2DPS, including cosmic variance and SKA noise. The manuscript provides no explicit validation (e.g., power-spectrum residuals, variance matching, or cross-code comparison) of this distributional fidelity for the stochastic components; this assumption is load-bearing for interpreting the coverage-test outcome.

Authors: We agree that more explicit validation of the emulator's fidelity to the stochastic components of the noisy 2DPS would strengthen the manuscript. The emulator was trained to reproduce the mean and variance of the 2DPS from 21cmFAST (including sample variance and SKA noise), but we did not present residual or variance-matching diagnostics in the submitted version. In the revision we will add these diagnostics, including mean residuals and variance comparisons across redshift bins, to better support the premise underlying the coverage-test results. We note that a full cross-code comparison lies outside the scope of the present work, which focuses on the 21cmFAST-based pipeline. revision: yes
Referee: [Results (coverage tests)] Coverage tests are invoked to support the negative result on emulated samples, yet the text does not report the numerical coverage probabilities, the number of test realizations, error bars on the coverage statistic, or the precise definition of the coverage metric employed. Without these quantities it is difficult to judge the statistical significance of the claimed lack of improvement.

Authors: We acknowledge that the coverage-test section would benefit from greater quantitative detail. In the revised manuscript we will report the numerical coverage probabilities at the 68 % and 95 % levels for both training regimes, state the number of independent test realizations used, include error bars on the coverage statistic, and provide an explicit definition of the coverage metric (fraction of test points falling within the stated credible intervals). These additions will allow readers to assess the statistical significance of the reported lack of improvement when emulated samples are included. revision: yes
Referee: [Results (posterior recovery)] The claim of “accurate recovery” of astrophysical parameters and ionization history is stated without a quantitative side-by-side comparison (posterior widths, calibration scores, or posterior-predictive checks) between the direct-simulation and emulator-trained NPE models. Such metrics are needed to substantiate the assertion that emulator use results in poorer inference.

Authors: We agree that a quantitative side-by-side comparison would make the claim of accurate recovery and the relative performance of the two training approaches more rigorous. The original manuscript presents example posterior distributions and ionization-history recoveries but does not tabulate metrics such as posterior widths or calibration scores. In the revision we will add a table (or text) comparing posterior widths for the key astrophysical parameters, together with calibration diagnostics, between the direct-simulation and emulator-trained NPE models. This will provide the requested quantitative support for the conclusion that emulator augmentation yields no improvement. revision: yes

Circularity Check

0 steps flagged

SBI pipeline grounded in external 21cmFAST simulations with no load-bearing self-referential reduction

full rationale

The derivation chain begins with external 21cmFAST simulations that generate the 2D power spectra including noise and variance, followed by training of emulators and application of neural posterior estimation (NPE) for inference. No equation or step reduces the recovered ionization history or astrophysical parameters to a fitted input by construction, nor does the central claim rest on a self-citation chain that is itself unverified. Coverage tests and direct comparison of simulation versus emulator training supply independent checks. The pipeline is therefore self-contained against the external simulation benchmark, consistent with a minor non-load-bearing self-citation score of 2.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the fidelity of 21cmFAST as a forward model and on the ability of the neural emulators to capture both the mean signal and its stochastic variance.

free parameters (1)

astrophysical parameters varied in 21cmFAST
Parameters controlling star formation, ionizing efficiency, and escape fraction are sampled to generate the training set; their posterior is the target of inference.

axioms (1)

domain assumption 21cmFAST plus added SKA noise and sample variance accurately represents the observable 2D power spectrum
Invoked when the emulator is used to generate training data for SBI.

pith-pipeline@v0.9.0 · 5841 in / 1313 out tokens · 41674 ms · 2026-05-18T21:51:49.877838+00:00 · methodology

Simulation based inference of the ionization history from the 2D 21 cm power spectrum

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)