Performance analysis of extragalactic classifications in Gaia Data Release 4

Coryn A. L. Bailer-Jones; Orlagh L. Creevey; Ruth Carballo; Sara Jamal

arxiv: 2605.23388 · v2 · pith:4NLTUMS5new · submitted 2026-05-22 · 🌌 astro-ph.GA · astro-ph.IM

Performance analysis of extragalactic classifications in Gaia Data Release 4

Sara Jamal , Coryn A. L. Bailer-Jones , Ruth Carballo , Orlagh L. Creevey This is my paper

Pith reviewed 2026-05-25 04:13 UTC · model grok-4.3

classification 🌌 astro-ph.GA astro-ph.IM

keywords Gaia DR4quasarsgalaxiesclassificationcompletenesspurityextragalactic sourcesinfrared photometry

0 comments

The pith

Gaia DR4's best classifiers reach at least 88% completeness and 96% purity for quasars and galaxies brighter than G=20.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates the Discrete Source Classifier in Gaia Data Release 4, which uses neural networks on astrometry, photometry, and low-resolution spectra to assign sources to quasar, galaxy, or star classes. It reports that the top versions improve purity over earlier releases with only minor completeness loss when tested on data away from the Magellanic Clouds. Performance stays high for bright sources but drops at fainter levels due to noise. Adding mid-infrared photometry from CatWISE2020 raises completeness at G greater than 20 by 9 to 29 percentage points, though purity falls by 1 to 9 points. The work also gives expected counts of selected objects in the full release and recommends Gaia data-quality cuts to lower contamination.

Core claim

When evaluated on a test set excluding the Magellanic Clouds, the best DSC classifiers in Gaia DR4 deliver at least 88% completeness and 96% purity for quasars and galaxies at G less than 20. Performance falls at fainter magnitudes, reaching minima of 55% completeness and 71% purity between G=20 and 20.5. Models that incorporate CatWISE2020 mid-infrared photometry recover 9 to 29 percentage points more completeness at G greater than 20, though purity drops by 1 to 9 points. The completeness-prioritizing combined classifier selects three million quasars and two million galaxies, while the purity-prioritizing ones select two million quasars and 1.3 million galaxies with lower contamination.

What carries the argument

The Discrete Source Classifier (DSC), which combines outputs from three neural networks trained on Gaia astrometry, photometry, and XP spectra, optionally with CatWISE2020 mid-infrared photometry, to produce class probabilities.

If this is right

The completeness-prioritizing DSC identifies three million quasars and two million galaxies in GDR4.
The purity-prioritizing DSC versions identify two million quasars and 1.3 million galaxies with lower expected contamination.
Inclusion of CatWISE2020 infrared photometry increases faint-end completeness for extragalactic classes.
Applying quality cuts to Gaia photometry and astrometry can raise the purity of extragalactic selections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported performance numbers suggest Gaia DR4 will supply the largest all-sky extragalactic samples yet available for population studies.
Position-dependent variations not captured in the test set could affect uniformity in large-scale structure analyses.
The approach of supplementing optical data with mid-infrared photometry could be tested on other surveys to check for similar completeness gains.

Load-bearing premise

The test set excluding the Magellanic Clouds provides an unbiased estimate of classifier performance as a function of brightness and sky position for the full Gaia survey.

What would settle it

A spectroscopic follow-up of a large random sample of sources classified as quasars or galaxies at G around 20 would directly measure whether the reported completeness and purity values hold.

Figures

Figures reproduced from arXiv: 2605.23388 by Coryn A. L. Bailer-Jones, Orlagh L. Creevey, Ruth Carballo, Sara Jamal.

**Figure 1.** Figure 1: Representation in 2D of the distribution of sources as a function of Galactic latitude and brightness of the training data set. Each distribution is normalised by the total number of sources in that panel. The colour scale is set such that brighter colours refer to regions of higher density compared to darker colours, which have fewer sources. The 2D representation of the density of sources is defined on a… view at source ↗

**Figure 2.** Figure 2: Top: Variation in the completeness and the purity of each class in the test set as a function of the posterior probability threshold used to assign a class. Bottom: Variation in the proportion of the unclassified sources with probabilities below the threshold, relative to the total number of sources in each class. The y-axis is limited to 0.5 for visualisation purposes only. target class) but overlooks tru… view at source ↗

**Figure 3.** Figure 3: DSC classification performance on the test data set as a function of source Galactic latitude and G-magnitude. Predicted labels are obtained by selecting the class with the highest posterior probability, using the global prior. Left to right, the completenesses and the purities of the quasar and galaxy classes. Rows (a), (b), (c), (d) and (e) refer to Specmod, Allosmod, Specmod-Supp, Combmod, and Combmod-α… view at source ↗

**Figure 4.** Figure 4: Variation in performance as a function of magnitude for sources at high latitudes as a function of absolute Galactic latitude for two brightness ranges (middle and right) for each class (rows). The one-dimensional plots are a marginalisation of the two-dimensional representations in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Galactic sky distribution of sources at magnitudes G<20.5 classified from maximum probabilities by DSC combined classifiers using the global prior at HEALpixel level 7 in Mollweide projection. Top: quasars. Bottom: galaxies. The colour map uses bright colours for high-density regions, while darker colours refer to regions with fewer observations. The LMC and SMC regions are masked in grey. Combmod identif… view at source ↗

**Figure 6.** Figure 6: Same as [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Colour-colour distributions of sources classified by Combmod-α using the global prior at magnitudes of G<20.5. Left: quasars. Right: galaxies. Sources located in the LMC and SMC are excluded. Contours (cyan lines) show the normalised density on a log scale of the highest-density regions for the trainset. (a) 1 313 150 quasars and 608 781 galaxies identified across the whole sky. (b) 1 234 753 quasars and 5… view at source ↗

**Figure 8.** Figure 8: Feature distributions of extragalactic candidates identified in the Gaia only mode by Combmod-α using the global prior, at magnitudes G<20.5 and higher latitudes |sin b|>0.20. Sources located in the LMC and SMC are excluded. (a) Quasar candidates. (b) Galaxy candidates. The quality cut is applied to Gaia photometry and astrometry. The photometric quality cut requires a minimum of ten photometric observatio… view at source ↗

read the original abstract

The Discrete Source Classifier (DSC) provides probabilistic classifications of sources in Gaia Data Release 4 (GDR4) based on empirically-trained Bayesian classifiers. Using Gaia astrometry, photometry, and low-resolution spectra (XP), DSC classifies all sources as quasars, galaxies, or stars. DSC comprises three trained neural networks and three combinations of their probabilities. When evaluated as a function of brightness and sky position on a test set excluding the Magellanic Clouds, the DSC purity in GDR4 has improved for a small loss in completeness. The average performance of the best classifiers at magnitudes brighter than G=20 is at least 88% completeness and 96% purity for the extragalactic classes, namely the quasar and galaxy classes. At fainter magnitudes, performance is lower due to increased noise. The average performance at magnitudes of 20$\leq$G<20.5 is a minimum of 55% completeness and 71% purity for the extragalactic classes. At G>20.5 mag, completeness is considerably reduced, primarily for the models that depend on the XP spectra. Furthermore, we train additional models on Gaia optical data together with mid-infrared photometry from the CatWISE2020 catalogue. Inclusion of infrared photometry increases the completeness of extragalactic samples at G>20 mag between 9 and 29 percentage points, at the cost of reducing purity between 1 and 9 percentage points. In GDR4, the best DSC-combined classifier prioritising completeness identifies three million quasars and two million galaxies, but with expected high contamination among fainter sources. In contrast, the combined classifiers prioritising purity identify approximately two million quasars and 1.3 million galaxies with an expected lower level of contamination. Finally, we provide recommendations for enhancing the purity of the DSC extragalactic selection by applying quality cuts to the Gaia photometry and astrometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives updated DR4 performance numbers for Gaia's extragalactic classifier and shows that adding CatWISE mid-IR data lifts completeness at the faint end.

read the letter

The main point is that this is a straightforward empirical update on the Discrete Source Classifier for Gaia DR4. It reports average completeness of at least 88% and purity of 96% for quasars and galaxies at G<20 on the test set, with purity up a bit from earlier releases at modest cost to completeness. Adding CatWISE2020 photometry raises completeness by 9-29 points at G>20 while cutting purity by 1-9 points. The paper also gives rough full-catalog source counts for the different classifier combinations and suggests quality cuts on photometry and astrometry to clean the samples further.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the performance of the Discrete Source Classifier (DSC) in Gaia Data Release 4 for classifying sources as quasars, galaxies, or stars using astrometry, photometry, and XP spectra. It reports completeness and purity on a test set excluding the Magellanic Clouds as functions of G magnitude and sky position, claims improvements in purity with minor completeness loss relative to prior releases, evaluates additional models augmented with CatWISE2020 mid-IR photometry, provides estimated counts of selected extragalactic sources in GDR4, and recommends quality cuts on Gaia data to enhance purity.

Significance. If the performance metrics generalize, the work supplies practical validation and usage guidance for large extragalactic samples from Gaia DR4, including the impact of IR augmentation and concrete sample-size estimates. The empirical evaluation on held-out test data is a clear strength.

major comments (2)

[Abstract] Abstract: The central claim of average performance at least 88% completeness and 96% purity for extragalactic classes at G<20 (and the fainter-magnitude figures) is derived from the test set after Magellanic Cloud exclusion. The manuscript states that performance is evaluated as a function of brightness and position, but provides no quantitative comparison of the remaining sky coverage, extragalactic fraction, local density, or XP spectrum quality against the full Gaia survey, which is required to establish that the reported averages are unbiased for the survey as a whole.
[Abstract] Abstract and the section describing the DSC training and test-set construction: The quantitative completeness/purity figures are presented without accompanying details on training procedures, data splits, cross-validation strategy, or uncertainty quantification (e.g., bootstrap or binomial errors on the reported percentages). This information is load-bearing for assessing the support for the stated performance numbers.

minor comments (2)

[Abstract] The abstract uses the phrasing 'a minimum of 55% completeness and 71% purity' for the 20≤G<20.5 bin; explicit per-class or per-classifier breakdowns in a table would improve clarity.
The description of the three neural networks and their probability combinations would benefit from a concise schematic or table listing input features and output combinations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments both concern the presentation of performance metrics in the abstract and supporting sections. We address each below and will revise the manuscript accordingly to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of average performance at least 88% completeness and 96% purity for extragalactic classes at G<20 (and the fainter-magnitude figures) is derived from the test set after Magellanic Cloud exclusion. The manuscript states that performance is evaluated as a function of brightness and position, but provides no quantitative comparison of the remaining sky coverage, extragalactic fraction, local density, or XP spectrum quality against the full Gaia survey, which is required to establish that the reported averages are unbiased for the survey as a whole.

Authors: We agree that a direct quantitative comparison between the test set (after Magellanic Cloud exclusion) and the full Gaia survey would strengthen the claim that the reported averages are representative. The exclusion was made to avoid regions with atypical stellar densities and training biases, and performance is already shown versus sky position, but we did not include explicit metrics such as fractional sky coverage, extragalactic source fraction, or median XP quality. In revision we will add a short table or paragraph in the methods or results section providing these comparisons (e.g., fraction of sky retained, G-magnitude histograms, and average XP SNR). revision: yes
Referee: [Abstract] Abstract and the section describing the DSC training and test-set construction: The quantitative completeness/purity figures are presented without accompanying details on training procedures, data splits, cross-validation strategy, or uncertainty quantification (e.g., bootstrap or binomial errors on the reported percentages). This information is load-bearing for assessing the support for the stated performance numbers.

Authors: We acknowledge that the abstract and the DSC section currently lack explicit statements on train/test splits, cross-validation, and uncertainty on the quoted percentages. The manuscript describes the classifier architecture and the held-out test set but does not detail the split ratios or error estimation. In the revised version we will expand the relevant section to specify the data partitioning, any cross-validation used during training, and will attach binomial or bootstrap uncertainties to the completeness and purity values reported in the abstract and main text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance metrics on held-out test data

full rationale

The paper reports completeness and purity metrics obtained by applying pre-trained neural networks (DSC) to a held-out test set after explicit exclusion of the Magellanic Clouds. No equations, derivations, or 'predictions' are presented that reduce to fitted parameters or self-citations by construction. Performance figures are direct empirical counts on the test data as a function of G magnitude and sky position. The additional CatWISE2020 models are likewise trained and evaluated on external photometry without any self-referential loop. Self-citations, if present, are not load-bearing for any claimed derivation. This is a standard empirical classifier evaluation and therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical evaluation of pre-trained neural networks on a held-out test set. No new free parameters are reported in the performance analysis itself.

axioms (1)

domain assumption The test set excluding the Magellanic Clouds is representative for performance evaluation across the sky
The abstract specifies this exclusion when reporting brightness- and position-dependent metrics.

pith-pipeline@v0.9.0 · 5894 in / 1281 out tokens · 74124 ms · 2026-05-25T04:13:54.654232+00:00 · methodology

Performance analysis of extragalactic classifications in Gaia Data Release 4

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)