Photometric classification of quasars from DES and photo-$z$ estimation with Machine Learning

Camila Cardoso; Elcio Abdalla; Filipe B. Abdalla; Gabriel S. Costa; Pablo Motta

arxiv: 2605.18218 · v2 · pith:W3V6TSY4new · submitted 2026-05-18 · 🌌 astro-ph.IM · astro-ph.CO

Photometric classification of quasars from DES and photo-z estimation with Machine Learning

Pablo Motta , Filipe B. Abdalla , Elcio Abdalla , Gabriel S. Costa , Camila Cardoso This is my paper

Pith reviewed 2026-05-20 00:20 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.CO

keywords quasar classificationphotometric redshiftsmachine learningDark Energy SurveyKNNcosmologylarge-scale structure

0 comments

The pith

KNN on DES photometry classifies quasars at 0.99 precision and builds an 872k-object photo-z catalog

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper cross-matches DES DR2 photometry with SDSS DR16 spectroscopy to assemble a training set of 168,738 point-like objects. A K-nearest neighbors algorithm applied to PSF magnitudes in the g, r, i, and z bands separates quasars from stars and galaxies at 0.99 precision and 0.77 recall. A hybrid machine-learning pipeline that combines boosted decision trees with a decision-tree regressor then estimates photometric redshifts, yielding a sample of 872,372 objects. After cleaning, 675,683 objects remain suitable for large-scale structure work at redshifts below 3, while the full catalog stays reliable for cosmological use near redshift 4.

Core claim

Cross-matching DES DR2 with SDSS DR16 produces a training set of 168,738 objects on which a KNN classifier using four-band PSF magnitudes separates quasars from contaminants at 0.99 precision with 0.77 recall. A hybrid ML approach combining boosted decision trees and a decision tree regressor then estimates photometric redshifts across 872,372 photometric objects, with 675,683 cleaned objects reliable for cosmological applications in the range 0 < z < 3 and the full set useful at z ≈ 4.

What carries the argument

K-Nearest Neighbors classifier on PSF magnitudes in the g, r, i, z bands for quasar selection, followed by a hybrid boosted decision tree plus decision tree regressor pipeline for photometric redshift estimation

Load-bearing premise

The cross-matched training sample of 168,738 objects is representative of the full DES photometric population without significant selection biases or distribution shifts.

What would settle it

Spectroscopic follow-up on a random subset of the photometrically classified objects to verify whether the reported 0.99 precision and 0.77 recall are reproduced on objects outside the training cross-match.

read the original abstract

This paper presents a comprehensive study of quasar photometric classification and redshift estimation using machine learning techniques. We cross-matched photometric data from the Dark Energy Survey Data Release 2 (DES DR2) with spectroscopic classifications from the Sloan Digital Sky Survey Data Release 16 (SDSS DR16), yielding an initial sample of 168,738 point-like objects. Using a K-Nearest Neighbors (KNN) algorithm with PSF magnitudes in the $g$, $r$, $i$, and $z$ bands, we achieved high-precision quasar/galaxy classification against stellar contaminants, reaching a recall of 0.77 at 0.99 precision. Photometric redshifts were subsequently estimated using a hybrid machine learning approach combining a Boosted Decision Tree from ANNz and a Decision Tree Regressor from scikit-learn. The resulting catalog spans redshifts from $z \approx 0.5$ to $z > 3$, with a distinct population recovered at $z \approx 4$. A stacked outlier classifier was developed to mitigate catastrophic redshift errors. The full photometric redshift sample contains 872,372 objects and remains reliable for cosmological applications at $z \approx 4$. The cleaned catalog contains 675,683 objects and is suitable for large-scale structure studies in the range $0 < z < 3$. This robustly characterized quasar catalog provides a valuable resource for future cosmological investigations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a new DES-based quasar catalog via standard ML methods, but selection biases in the SDSS training set are a real concern for the claimed performance.

read the letter

The main thing to know is that this work applies off-the-shelf machine learning to create a photometric quasar catalog from the Dark Energy Survey DR2, cross-matched with SDSS for training labels. They report solid classification performance and a large final sample. What the paper actually does is take 168,738 point-like objects from the cross-match and train a K-nearest neighbors model on g, r, i, z PSF magnitudes to classify quasars versus contaminants. They follow that with a hybrid photo-z estimator using boosted decision trees and regressors. The output is a catalog of 872,372 objects with redshifts from about 0.5 up to beyond 3, including a noted population around z=4. After cleaning, they have 675,683 objects they consider good for large-scale structure work. This is new in the sense that it gives specific numbers and a cleaned sample tailored to DES data. The focus on recovering high-redshift quasars and providing a resource for cosmology is a practical step. The paper handles the metrics clearly in the abstract, which helps readers gauge what they are getting. The soft spots are around generalization. The training set comes from SDSS spectroscopy, which has its own magnitude limits and targeting strategy. That could mean the objects used to train the model do not match the distribution of the full DES photometric sample. Without tests for that, like comparing distributions or using domain adaptation, the 0.77 recall at 0.99 precision might not apply to the 872k catalog, especially at higher redshifts. Minor issues include the lack of reported uncertainties on the performance numbers and limited discussion of validation procedures. Readers who work on quasar selection for cosmological probes or large-scale structure analyses would find this useful as a data product. It is not breaking new ground in methods, but the catalog itself could be a reference point. The work shows clear thinking in combining the tools and cleaning the sample, so it is worth a serious referee's time. I recommend putting it through peer review, with reviewers asked to look closely at whether the training and target distributions align.

Referee Report

2 major / 2 minor

Summary. The paper cross-matches DES DR2 photometry with SDSS DR16 spectroscopy to obtain 168,738 point-like objects and applies a KNN classifier on PSF g,r,i,z magnitudes to separate quasars from galaxies and stars, reporting a recall of 0.77 at 0.99 precision. A hybrid ML pipeline (ANNz boosted decision tree plus scikit-learn decision tree regressor) then produces photometric redshifts, yielding a catalog of 872,372 objects asserted to be reliable for cosmology at z≈4 and a cleaned subset of 675,683 objects for large-scale structure studies between 0<z<3.

Significance. A large, photometrically classified quasar sample extending to z≈4 would be a useful resource for cosmological analyses if the quoted performance metrics generalize beyond the training set. The work demonstrates a practical application of standard ML tools to a wide-field survey.

major comments (2)

[Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.
[Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.

minor comments (2)

[Abstract] Abstract: the redshift range is described as 'z ≈ 0.5 to z > 3, with a distinct population recovered at z ≈ 4'; provide the precise redshift bounds of the final catalog and the criterion used to identify the z≈4 population.
[Abstract] Abstract: clarify whether the 'stacked outlier classifier' is applied before or after the hybrid photo-z step and how it affects the final sample sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our work. We respond to each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.

Authors: We agree that explicit checks for representativeness are needed to support the reliability claim. The manuscript uses the SDSS-DES cross-match as the largest available spectroscopic anchor for DES DR2, but we will revise the abstract and add a new subsection in the methods describing magnitude and color distribution comparisons between the training sample and the full DES point-like photometric population. We will also outline a magnitude-based reweighting scheme and note its limitations for high-z selection biases. revision: yes
Referee: [Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.

Authors: The metrics were computed on a held-out test set after 5-fold cross-validation for hyperparameter tuning on the training portion. We will revise the abstract and methods to report bootstrap error bars on the recall and precision, explicitly describe the cross-validation folds, and include a sensitivity plot showing performance stability for K between 3 and 15. This will confirm that the reported values reflect validated choices rather than post-hoc adjustment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML classification and photo-z pipeline

full rationale

The paper applies off-the-shelf KNN and hybrid ML (ANNz BDT + scikit-learn regressor) to a cross-matched DES-SDSS training set of 168738 objects, then reports recall/precision and produces a catalog of 872372 objects. All performance numbers are computed directly against external spectroscopic labels; no functional form is fitted and then re-used as a 'prediction', no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming occurs. The representativeness of the training sample is an empirical assumption whose validity can be tested externally, but it does not make the reported metrics tautological by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central results rest on the representativeness of the spectroscopic training set and on standard assumptions of supervised ML (i.i.d. training and test distributions, appropriate feature choice). No new physical axioms or invented entities are introduced.

free parameters (2)

K in KNN
Hyperparameter controlling neighborhood size for classification; value not stated in abstract but required for the reported precision-recall numbers.
hyperparameters of boosted decision tree and regressor
Learning rate, number of estimators, and tree depth chosen to optimize the hybrid photo-z model.

axioms (1)

domain assumption The cross-matched DES-SDSS sample is free of significant selection bias relative to the full photometric population.
Invoked when training on 168k objects and applying to the larger photometric set.

pith-pipeline@v0.9.0 · 5800 in / 1352 out tokens · 33579 ms · 2026-05-20T00:20:48.187613+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a K-Nearest Neighbors (KNN) algorithm with PSF magnitudes in the g, r, i, and z bands, we achieved high-precision quasar/galaxy classification against stellar contaminants, reaching a recall of 0.77 at 0.99 precision.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The full photometric redshift sample contains 872,372 objects and remains reliable for cosmological applications at z≈4.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.