Improved photometric redshift estimations through self-organising map-based data augmentation

Alex I. Malz; Eric Gawiser; Henk Hoekstra; Irene Moskowitz; Joe Zuntz; Konrad Kuijken; Marika Asgari; The LSST Dark Energy Science Collaboration; Tianqing Zhang; Yun-Hao Zhang

arxiv: 2508.20903 · v2 · submitted 2025-08-28 · 🌌 astro-ph.GA · astro-ph.CO

Improved photometric redshift estimations through self-organising map-based data augmentation

Yun-Hao Zhang , Joe Zuntz , Irene Moskowitz , Eric Gawiser , Konrad Kuijken , Marika Asgari , Henk Hoekstra , Alex I. Malz

show 3 more authors

Ziang Yan Tianqing Zhang The LSST Dark Energy Science Collaboration

This is my paper

Pith reviewed 2026-05-18 20:46 UTC · model grok-4.3

classification 🌌 astro-ph.GA astro-ph.CO

keywords photometric redshiftsself-organising mapsdata augmentationLSSTmock cataloguesCosmoDC2galaxy surveysredshift estimation

0 comments

The pith

Augmenting under-sampled regions of self-organising maps with simulated galaxies reduces photometric redshift biases and cuts catastrophic failures by up to a factor of two for high-redshift objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that projects galaxy spectral energy distributions onto a two-dimensional self-organising map to locate regions poorly covered by existing spectroscopic samples. These sparse areas are then filled with galaxies drawn from independent mock catalogues to produce a more complete training set for photometric redshift models. When tested on realistic mock data sets that mimic LSST Year 1 and Year 10 observations, the augmented training sets yield smaller systematic errors and roughly half as many catastrophic outliers, especially at true redshifts above 1.5. Accurate photometric redshifts are a prerequisite for most cosmological and galaxy-evolution studies with next-generation imaging surveys.

Core claim

Projecting galaxy SEDs onto a SOM identifies under-sampled regions that are then augmented with simulated galaxies from the independent CosmoDC2 catalogues; when this augmented training set is used on 501 degraded realisations of LSST-like photometry, photometric redshift performance improves markedly relative to unaugmented models, with reduced systematic biases, up to a factor-of-two drop in catastrophic failures, and less information loss in the conditional density estimates, particularly for galaxies at z_true ≳ 1.5.

What carries the argument

Self-organising maps that project galaxy spectral energy distributions into a two-dimensional grid, thereby flagging regions sparsely populated by real spectroscopic observations and enabling targeted augmentation with simulated galaxies.

Load-bearing premise

The simulated galaxies drawn from CosmoDC2 must reproduce the colour and magnitude distributions of real galaxies that occupy the under-sampled SOM cells, and the 501 degraded mock realisations must faithfully represent the selection functions and spectroscopic success rates of future LSST follow-up surveys.

What would settle it

Apply the SOM-augmented training procedure to real spectroscopic samples from ongoing surveys and check whether the measured reductions in bias and outlier rate match the factor-of-two improvement seen in the mocks when validated against independent, high-fidelity redshifts.

read the original abstract

We introduce a framework for the enhanced estimation of photometric redshifts using Self-Organising Maps (SOMs). Our method projects galaxy Spectral Energy Distributions (SEDs) onto a two-dimensional map, identifying regions that are sparsely sampled by existing spectroscopic observations. These under-sampled areas are then augmented with simulated galaxies, yielding a more representative spectroscopic training dataset. To assess the efficacy of this SOM-based data augmentation in the context of the forthcoming Legacy Survey of Space and Time (LSST), we employ mock galaxy catalogues from the OpenUniverse2024 project and generate synthetic datasets that mimic the expected photometric selections of LSST after one (Y1) and ten (Y10) years of observation. We construct 501 degraded realisations by sampling galaxy colours, magnitudes, redshifts and spectroscopic success rates, in order to emulate the compilation of a wide array of realistic spectroscopic surveys. Augmenting the degraded mock datasets with simulated galaxies from the independent CosmoDC2 catalogues has markedly improved the performance of our photometric redshift estimates compared to models lacking this augmentation, particularly for high-redshift galaxies ($z_\mathrm{true} \gtrsim 1.5$). This improvement is manifested in notably reduced systematic biases and a decrease in catastrophic failures by up to approximately a factor of 2, along with a reduction in information loss in the conditional density estimations. These results underscore the effectiveness of SOM-based augmentation in refining photometric redshift estimation, thereby enabling more robust analyses in cosmology and astrophysics for the NSF-DOE Vera C. Rubin Observatory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOM-guided augmentation from CosmoDC2 cuts photo-z catastrophic failures by up to 2x on LSST mocks, but the gains rest on untested simulation fidelity for real missing galaxies.

read the letter

The core result is that projecting galaxy SEDs onto a SOM, spotting under-sampled cells, and adding targeted galaxies from the independent CosmoDC2 catalog improves photometric redshift estimates on degraded OpenUniverse2024 mocks. Catastrophic outlier rates drop by roughly a factor of two at z_true greater than 1.5, with accompanying reductions in bias and information loss. They demonstrate this across 501 realizations that sample colors, magnitudes, redshifts, and spectroscopic success rates to mimic LSST Y1 and Y10 selections.

Referee Report

2 major / 1 minor

Summary. The paper proposes a SOM-based framework to identify under-sampled regions in galaxy color-magnitude space from spectroscopic data and augments the training set with simulated galaxies drawn from the independent CosmoDC2 catalog. The method is evaluated on 501 degraded realizations of OpenUniverse2024 mocks that emulate LSST Y1 and Y10 photometric selections and spectroscopic success rates, with reported gains in bias reduction, fewer catastrophic outliers (up to factor ~2), and lower information loss in conditional density estimates, especially for z_true ≳ 1.5.

Significance. If the augmentation correctly targets the actual missing population rather than simulation-to-simulation differences, the approach could meaningfully improve high-redshift photo-z performance for LSST cosmology. The use of 501 independent degraded realizations and cross-catalog augmentation is a clear methodological strength that supports reproducibility of the reported gains.

major comments (2)

[Mock construction and augmentation procedure] Mock construction and augmentation procedure: no direct validation is presented that the color-magnitude distributions (or SEDs) of the added CosmoDC2 galaxies match the actual galaxies that occupy the under-sampled SOM cells in real spectroscopic follow-up. The observed reduction in bias and outliers could therefore arise from differences between the two simulation suites rather than from correcting observational incompleteness.
[Section describing the mock construction] Section describing the mock construction: while the 501 realizations sample colors, magnitudes, redshifts and success rates, the quantitative sensitivity of the reported improvements to specific degradation choices (e.g., exact spectroscopic success-rate model or photometric noise realization) is not quantified, leaving open whether the factor-of-two gain is robust to plausible variations in the LSST follow-up scenario.

minor comments (1)

The abstract and main text would benefit from an explicit statement of the SOM grid size, neighborhood function, and convergence criteria used for the projection step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us improve the clarity and robustness of our presentation. We address each major comment below, making revisions to the manuscript where appropriate to strengthen the discussion of our mock-based validation framework.

read point-by-point responses

Referee: Mock construction and augmentation procedure: no direct validation is presented that the color-magnitude distributions (or SEDs) of the added CosmoDC2 galaxies match the actual galaxies that occupy the under-sampled SOM cells in real spectroscopic follow-up. The observed reduction in bias and outliers could therefore arise from differences between the two simulation suites rather than from correcting observational incompleteness.

Authors: We agree that a direct comparison to real spectroscopic follow-up data is not possible in this study, which is conducted entirely within a controlled mock framework. Our design deliberately employs two independent simulation suites (OpenUniverse2024 for the degraded spectroscopic training set and CosmoDC2 for augmentation) to emulate the real-world scenario in which simulations are used to fill gaps in observational samples. In the revised manuscript we have added a new comparison (Figure X and accompanying text in Section 3) of the color-magnitude and SED distributions of the CosmoDC2 galaxies assigned to under-sampled SOM cells against the full underlying distribution in the OpenUniverse2024 mocks. This shows that the augmentation populates the relevant regions of parameter space. Because the degradation procedure explicitly introduces the spectroscopic incompleteness and the largest gains occur at z_true ≳ 1.5 where incompleteness is most severe, we maintain that the reported reductions in bias and outliers arise primarily from correcting the sampling gaps rather than from catalog-to-catalog differences. revision: partial
Referee: Section describing the mock construction: while the 501 realizations sample colors, magnitudes, redshifts and success rates, the quantitative sensitivity of the reported improvements to specific degradation choices (e.g., exact spectroscopic success-rate model or photometric noise realization) is not quantified, leaving open whether the factor-of-two gain is robust to plausible variations in the LSST follow-up scenario.

Authors: The 501 realizations already incorporate substantial variation by sampling galaxy colors, magnitudes, redshifts, and spectroscopic success rates drawn from a range of plausible LSST-like survey configurations. Nevertheless, we acknowledge that we have not performed an exhaustive sensitivity analysis that varies the precise functional form of the success-rate model or additional independent noise realizations beyond the ensemble sampling. In the revised manuscript we have expanded the discussion in the mock-construction section to report that the factor-of-two reduction in catastrophic outliers and the bias improvements remain statistically consistent when the ensemble is subdivided by average success rate and by photometric depth. While a more comprehensive parameter-variation study would be valuable, the existing ensemble already demonstrates that the gains are not driven by a single narrow choice of degradation parameters. revision: partial

Circularity Check

0 steps flagged

No significant circularity; augmentation and evaluation use independent external catalogs and held-out mocks

full rationale

The paper constructs degraded realizations from OpenUniverse2024 mocks, augments them with galaxies drawn from the separate CosmoDC2 catalog, and evaluates photometric redshift performance on held-out mock galaxies whose redshifts are known by construction. No step reduces by definition or self-citation to the input data; the reported bias and outlier reductions are measured against non-augmented baselines using external simulation data. This is a standard self-contained empirical test against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central improvement claim rests on the assumption that the mock catalogs and degradation procedure accurately represent real LSST observations and that CosmoDC2 galaxies are statistically representative of the missing real galaxies in color space. No new physical entities are introduced.

axioms (2)

domain assumption The 501 degraded realizations adequately sample the range of photometric selections, magnitude limits, and spectroscopic success rates expected for LSST Y1 and Y10.
Invoked when constructing the test datasets that emulate future surveys.
domain assumption CosmoDC2 galaxies occupy the same color-magnitude-redshift space as real galaxies that would be observed in the under-sampled SOM cells.
Required for the augmentation step to improve rather than degrade the training set.

pith-pipeline@v0.9.0 · 5851 in / 1517 out tokens · 45353 ms · 2026-05-18T20:46:28.590515+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Augmenting the degraded mock datasets with simulated galaxies from the independent CosmoDC2 catalogues has markedly improved the performance of our photometric redshift estimates... particularly for high-redshift galaxies (z_true ≳ 1.5)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ the Self-Organising Map (SOM) machine learning algorithm to map galaxy broadband SEDs onto a two-dimensional grid

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.