Observational constraints on the origin of the elements. X. Combining NLTE and machine learning for chemical diagnostics of 4 million stars in the 4MIDABLE-HR survey

Georges Kordopatis; Gra\v{z}ina Tautvai\v{s}ien\.e; Gregor Traven; Guillaume Guiglion; Maria Bergemann; Mingjie Jian; Nicholas Storm; Ross P. Church; Thomas Bensby; Tomasz R\'o\.za\'nski

arxiv: 2512.15888 · v2 · submitted 2025-12-17 · 🌌 astro-ph.SR · astro-ph.GA

Observational constraints on the origin of the elements. X. Combining NLTE and machine learning for chemical diagnostics of 4 million stars in the 4MIDABLE-HR survey

Nicholas Storm , Maria Bergemann , Tomasz R\'o\.za\'nski , Victor F. Ksoll , Thomas Bensby , Gregor Traven , Georges Kordopatis , Ross P. Church

show 4 more authors

Mingjie Jian Weijia Sun Guillaume Guiglion Gra\v{z}ina Tautvai\v{s}ien\.e

This is my paper

Pith reviewed 2026-05-16 21:16 UTC · model grok-4.3

classification 🌌 astro-ph.SR astro-ph.GA

keywords stellar abundancesNLTEmachine learning4MOSTgalactic chemical evolutionFGK starsspectroscopic analysisMilky Way

0 comments

The pith

Neural network trained on NLTE spectra recovers 18 elemental abundances from 4MOST-quality data with biases under 0.13 dex

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an artificial neural network trained on 404793 synthetic FGK spectra computed in non-local thermodynamic equilibrium for 16 elements. This network forms part of a pipeline to automatically derive stellar parameters and abundances for the 4 million stars targeted by the 4MIDABLE-HR survey. Validation on 121 observed low-mass stars degraded to the survey resolution shows that all 18 abundances are recovered with typical biases and spreads below 0.09 dex. The derived abundances are then compared to predictions from the OMEGA+ galactic chemical evolution model. This demonstrates the expected performance for high-resolution spectra from 4MOST and the potential to use multiple elements to trace the Milky Way's formation history.

Core claim

The central claim is that the 4MOST-HR resolution NLTE Payne ANN, trained on 404793 new FGK spectra, enables a fully automatic fitting algorithm to self-consistently derive stellar parameters and 18 elemental abundances from spectra at R approximately 20000. When tested on 121 observed FGKM stars spanning main-sequence to giant phases and down to [Fe/H] approximately -3.3, the method recovers abundances with bias less than 0.13 dex and spread less than 0.16 dex, with typical values under 0.09 dex for most elements. These measurements recover the expected galactic trends when compared to the OMEGA+ model.

What carries the argument

The NLTE Payne artificial neural network, which takes high-resolution spectra as input and outputs stellar parameters plus abundances after training on grids of NLTE radiative-transfer spectra.

If this is right

The pipeline can process the full four million stars in the 4MIDABLE-HR survey with quantified uncertainties.
Multiple-element abundance patterns from the survey will be directly comparable to OMEGA+ galactic chemical evolution predictions.
The low bias and spread values establish the precision level expected for 4MOST high-resolution data.
Trends recovered across 18 elements will help constrain the formation and enrichment history of the Milky Way disc and bulge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar networks could be retrained for other upcoming surveys that reach comparable resolution and signal-to-noise.
The method opens the possibility of mapping subtle abundance variations across large stellar samples to identify specific nucleosynthetic sites.
Extending the validation to a wider metallicity or temperature range would strengthen claims about applicability to the oldest stars.

Load-bearing premise

The 404793 training spectra computed in NLTE accurately represent the full range of stars and conditions present in the 4MIDABLE-HR survey targets.

What would settle it

Measuring the same 121 or a larger set of observed stars with an independent high-resolution analysis code or with spectra taken at higher resolution than R=20000 and checking whether the abundance differences remain below 0.1 dex would test the claimed accuracy.

Figures

Figures reproduced from arXiv: 2512.15888 by Georges Kordopatis, Gra\v{z}ina Tautvai\v{s}ien\.e, Gregor Traven, Guillaume Guiglion, Maria Bergemann, Mingjie Jian, Nicholas Storm, Ross P. Church, Thomas Bensby, Tomasz R\'o\.za\'nski, Victor F. Ksoll, Weijia Sun.

**Figure 1.** Figure 1: Distribution in a 2D histogram of synthetic spectra used to train the Payne. Top left panel shows the Kiel diagram, while the rest are distributions as a function of [Fe/H]. All abundances are chosen uniformly random in metallicity space, except for A(O)< 8.87 and A(C)< 8.7. There are less low metallicity giant model atmospheres, resulting in slightly less spectra at low metallicities. There are also no p… view at source ↗

**Figure 2.** Figure 2: Kiel diagram of the fitted stellar sample with [Fe/H] in colour with PARSEC evolutionary tracks (A. Bressan et al. 2012) in colour. Our calibration stars were selected from a sample of benchmark stars (U. Heiter et al. 2015), and we also included spectra of nearby bright stars from our previous studies in K. Fuhrmann et al. (1993); K. Fuhrmann (1998); T. Gehren et al. (2004, 2006); M. Bergemann & T. Gehre… view at source ↗

**Figure 3.** Figure 3: Payne fit (red lines) to the HD 140283 and HD 84937 UVES spectra (black dots), degraded to R ≈ 20000 resolution, in all three 4MOST-HR windows. The subplots are zoom-ins to different regions of the spectra. ing the recovery impossible in most spectra (it has also been noted for Gaia-ESO, see E. Pancino et al. 2017). Strontium only has lines in the blue, which resulted in a low recovery rate in our spectra … view at source ↗

**Figure 4.** Figure 4: Comparison between the derived abundance from Payne and TSFitPy plotted in absolute A(X) units. The green line is a one-to-one comparison. Each subpanel shows the average difference (bias) and standard deviation (std) when comparing the abundances from the two sources. Teff [100K] logg Li C O Na Mg Al Si Ca Ti Cr Mn Fe Co Ni Sr Y Ba Eu Label 0.03 0.05 0.1 0.2 Estimated Systematic Error [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 5.** Figure 5: Estimated systematic error for stellar parameters and abundances. The error was estimated by taking spread of the difference between the derived parameter from Payne and literature (for Teff and log g for benchmark stars from U. Heiter et al. (2015)) or TSFitPy. and all of them are weak, resulting in a large spread of ≈ 0.16 dex. Lastly, europium has a weak line in the red at 6645 ˚A, typically used at hig… view at source ↗

**Figure 6.** Figure 6: Chemical abundances derived using the new NLTE Payne ANN (black points) from the observed archival data of Gaia-ESO benchmark and nearby bright stars. The archival spectra are described in Sect. 2.4. OMEGA+ GCE models are overplotted with three distinct star formation efficiencies ϵ∗: 0.01 (solid black), 0.03 (solid red) and 1 (dashed blue). SNIa DTD and the IMF. In particular, a higher SFE pushes the knee… view at source ↗

**Figure 7.** Figure 7: Lithium abundance as a function of [Fe/H]. The GCE models from B. D. Fields & K. A. Olive (1999a,b) for different production channels are plotted, with the black line showing the total value. The solar fitted value is marked as a red star. We discuss lithium in a separate section, as we have not previously tested our OMEGA+ GCE model for lithium. Instead we opt to use the B. D. Fields & K. A. Olive (1999a,… view at source ↗

read the original abstract

We present the 4MOST-HR resolution Non-Local Thermal Equilibrium (NLTE) Payne artificial neural network (ANN), trained on $404\,793$ new FGK spectra with 16 elements computed in NLTE. This network will be part of the Stellar Abundances and atmospheric Parameters Pipeline (SAPP), which will analyse 4 million stars during the five year long 4MOST consortium 4: 4MOST MIlky way Disc And BuLgE High-Resolution (4MIDABLE-HR) survey. A fitting algorithm using this ANN is also presented that is able to fully-automatically and self-consistently derive both stellar parameters and elemental abundances. The ANN is validated by fitting 121 observed spectra of low-mass FGKM type stars, including main-sequence dwarf, subgiant and giant stars down to [Fe/H] $\approx -3.3$ degraded to 4MOST-HR resolution of $R\approx20\,000$, and comparing the derived abundances with the output of the classical radiative transfer code TSFitPy. We are able to recover all 18 elemental abundances with a bias~$<0.13$ and spread~$<0.16$\,dex, although the typical values are $<0.09$ dex for most elements. These abundances are compared to the OMEGA+ Galactic Chemical Evolution model, showcasing for the first time, the expected performance and results obtained from high-resolution spectra of the quality expected to be obtained with 4MOST. The expected Galactic trends are recovered, and we highlight the potential of using many chemical elements to constrain the formation history of the Galaxy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A new NLTE-trained ANN and fitting pipeline for 4MOST-HR spectra that delivers usable abundance precision on validation data, though scaling claims to 4 million stars rest on limited coverage checks.

read the letter

The core advance is a Payne-style ANN trained on 404k NLTE synthetic spectra at 4MOST-HR resolution, paired with an automatic fitting routine that returns both parameters and 18 elemental abundances in one go. That combination is new for this survey setup and resolution. The validation step compares the network output on 121 degraded observed spectra against TSFitPy and reports typical biases below 0.09 dex with spreads under 0.16 dex, which is respectable for an automated method. They also show that the derived abundances trace the expected galactic trends from the OMEGA+ model, which gives a first look at what the full 4MIDABLE-HR sample might deliver. The work is straightforward and the numbers are stated clearly. The soft spot is the jump from 121 validation stars to the full 4 million targets. The training set distribution is not shown to match the joint range of Teff, log g, metallicity, and S/N that the survey will actually contain, and there are no reported ablation tests on the tails such as [Fe/H] below -2 or low-S/N giants. If those regions are undersampled in the validation set, the quoted performance numbers may not hold uniformly. The paper is aimed at teams running large spectroscopic surveys and galactic archaeology groups who need a fast, self-consistent abundance pipeline. Readers working on 4MOST data reduction or chemical tagging will find the concrete metrics and the comparison to TSFitPy useful. It is solid enough on the methods side to merit a serious referee, even if the generalization argument needs tightening. I would send it out for review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper presents an NLTE Payne ANN trained on 404793 synthetic FGK spectra (16 elements) for the SAPP pipeline to derive stellar parameters and 18 elemental abundances from 4MOST-HR spectra. A fitting algorithm is described that is validated on 121 observed low-mass FGKM spectra (dwarfs to giants, down to [Fe/H]≈-3.3) degraded to R≈20000, yielding biases <0.13 dex and spreads <0.16 dex (typically <0.09 dex) versus TSFitPy; the derived abundances are then compared to OMEGA+ GCE models to illustrate expected survey performance and recovered galactic trends.

Significance. If the reported generalization holds, the work is significant for enabling scalable, NLTE-consistent abundance analysis across millions of stars in 4MOST and similar surveys. It combines machine learning with detailed radiative transfer to address the volume of upcoming high-resolution data, and the direct comparison to GCE models provides a concrete demonstration of how such abundances can constrain galactic formation history.

major comments (1)

[Validation on observed spectra] Validation section (and abstract claim of 'expected performance'): the bias <0.13 dex and spread <0.16 dex metrics are obtained exclusively from 121 degraded observed spectra. No quantitative comparison of the joint (Teff, log g, [Fe/H], S/N) distribution between the 404793 training spectra and the full 4MIDABLE-HR target sample is provided, nor any ablation or coverage test for extrapolation in the tails (e.g., [Fe/H]<-2 or low-S/N giants). This directly underpins the central claim that the metrics represent expected performance on 4 million stars.

minor comments (1)

Clarify the exact set of 18 elements recovered versus the 16 used in training, and whether any post-processing or additional lines are involved.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and for the detailed comment on validation. We address the concern below and will revise the manuscript to strengthen the support for the claimed expected performance on the 4MIDABLE-HR sample.

read point-by-point responses

Referee: Validation section (and abstract claim of 'expected performance'): the bias <0.13 dex and spread <0.16 dex metrics are obtained exclusively from 121 degraded observed spectra. No quantitative comparison of the joint (Teff, log g, [Fe/H], S/N) distribution between the 404793 training spectra and the full 4MIDABLE-HR target sample is provided, nor any ablation or coverage test for extrapolation in the tails (e.g., [Fe/H]<-2 or low-S/N giants). This directly underpins the central claim that the metrics represent expected performance on 4 million stars.

Authors: We agree that an explicit comparison of the joint parameter distributions and targeted tests for the tails would better substantiate the generalization to the full survey. The training grid was constructed to cover the expected FGK parameter space for 4MIDABLE-HR (including [Fe/H] down to -3), and the 121 validation spectra already reach [Fe/H] ≈ -3.3 across dwarfs to giants, but we acknowledge the absence of a direct quantitative overlay or ablation study. In the revised manuscript we will add a new figure (or expanded panel) showing the joint (Teff, log g, [Fe/H], S/N) distributions for the training set, the validation set, and the anticipated 4MIDABLE-HR target distribution derived from the survey selection function. We will also include performance metrics stratified by metallicity bins (explicitly for [Fe/H] < -2) and by S/N and luminosity class. These additions will appear in the validation section and will be referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

Validation on independent observed spectra and separate radiative transfer code prevents reduction of performance claims to training inputs

full rationale

The central derivation trains an ANN on 404793 independently computed NLTE synthetic spectra, then measures bias and spread exclusively by comparing ANN-derived abundances on 121 real observed spectra against the separate TSFitPy code. This validation step is external to the training set and does not reduce the reported metrics to fitted parameters by construction. No self-citation chain, ansatz smuggling, or uniqueness theorem is invoked to justify the core performance numbers or the generalization claim. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of the pre-computed NLTE training spectra and the assumption that the ANN generalizes from 121 validation stars to the full survey. No new physical entities are introduced.

free parameters (1)

ANN network weights and biases
Fitted during training on the 404793 spectra; central to the method but standard for neural networks.

axioms (1)

domain assumption NLTE radiative transfer calculations used for training spectra are sufficiently accurate representations of real stellar atmospheres
Invoked when generating the training set of 404793 spectra.

pith-pipeline@v0.9.0 · 5688 in / 1268 out tokens · 22473 ms · 2026-05-16T21:16:47.976703+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the 4MOST-HR resolution Non-Local Thermal Equilibrium (NLTE) Payne artificial neural network (ANN), trained on 404793 new FGK spectra with 16 elements computed in NLTE... A fitting algorithm using this ANN is also presented that is able to fully-automatically and self-consistently derive both stellar parameters and elemental abundances.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

These abundances are compared to the OMEGA+ Galactic Chemical Evolution model, showcasing for the first time, the expected performance and results obtained from high-resolution spectra...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

M., Asplund, M., Collet, R., & Leenaarts, J

Amarsi, A. M., Asplund, M., Collet, R., & Leenaarts, J. 2015, MNRAS, 454, L11, doi: 10.1093/mnrasl/slv122 Arcones, A., & Thielemann, F.-K. 2023, A&A Rv, 31, 1, doi: 10.1007/s00159-022-00146-x Arnould, M., Goriely, S., & Takahashi, K. 2007, PhR, 450, 97, doi: 10.1016/j.physrep.2007.06.002 Barbuy, B., Chiappini, C., & Gerhard, O. 2018, ARA&A, 56, 223, doi: ...

work page doi:10.1093/mnrasl/slv122 2015
[2]

The latter limits the output to within 0 to 1, consistent with the range of normalised spectra

and consists of 3 fully-connected hidden layers of 1024 neu- rons each, using Sigmoid Linear Unit (SiLU) activation functions, and a final output layer of size 33375 with a sigmoid activation function. The latter limits the output to within 0 to 1, consistent with the range of normalised spectra. The network size was chosen as a balance be- tween complexi...

work page 2025
[3]

For our network, reducing the number of training steps or training spec- tra by half had the smallest impact

A too low or too high initial learning rate can also result in a suboptimal training convergence. For our network, reducing the number of training steps or training spec- tra by half had the smallest impact. Out of the final training set of 404 793 spectra, 6% were used for the validation set. The individual abundance values were chosen in a uniformly ran...

work page 2048

[1] [1]

M., Asplund, M., Collet, R., & Leenaarts, J

Amarsi, A. M., Asplund, M., Collet, R., & Leenaarts, J. 2015, MNRAS, 454, L11, doi: 10.1093/mnrasl/slv122 Arcones, A., & Thielemann, F.-K. 2023, A&A Rv, 31, 1, doi: 10.1007/s00159-022-00146-x Arnould, M., Goriely, S., & Takahashi, K. 2007, PhR, 450, 97, doi: 10.1016/j.physrep.2007.06.002 Barbuy, B., Chiappini, C., & Gerhard, O. 2018, ARA&A, 56, 223, doi: ...

work page doi:10.1093/mnrasl/slv122 2015

[2] [2]

The latter limits the output to within 0 to 1, consistent with the range of normalised spectra

and consists of 3 fully-connected hidden layers of 1024 neu- rons each, using Sigmoid Linear Unit (SiLU) activation functions, and a final output layer of size 33375 with a sigmoid activation function. The latter limits the output to within 0 to 1, consistent with the range of normalised spectra. The network size was chosen as a balance be- tween complexi...

work page 2025

[3] [3]

For our network, reducing the number of training steps or training spec- tra by half had the smallest impact

A too low or too high initial learning rate can also result in a suboptimal training convergence. For our network, reducing the number of training steps or training spec- tra by half had the smallest impact. Out of the final training set of 404 793 spectra, 6% were used for the validation set. The individual abundance values were chosen in a uniformly ran...

work page 2048