Photometric classification of quasars from DES and photo-z estimation with Machine Learning
Pith reviewed 2026-05-20 00:20 UTC · model grok-4.3
The pith
KNN on DES photometry classifies quasars at 0.99 precision and builds an 872k-object photo-z catalog
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-matching DES DR2 with SDSS DR16 produces a training set of 168,738 objects on which a KNN classifier using four-band PSF magnitudes separates quasars from contaminants at 0.99 precision with 0.77 recall. A hybrid ML approach combining boosted decision trees and a decision tree regressor then estimates photometric redshifts across 872,372 photometric objects, with 675,683 cleaned objects reliable for cosmological applications in the range 0 < z < 3 and the full set useful at z ≈ 4.
What carries the argument
K-Nearest Neighbors classifier on PSF magnitudes in the g, r, i, z bands for quasar selection, followed by a hybrid boosted decision tree plus decision tree regressor pipeline for photometric redshift estimation
Load-bearing premise
The cross-matched training sample of 168,738 objects is representative of the full DES photometric population without significant selection biases or distribution shifts.
What would settle it
Spectroscopic follow-up on a random subset of the photometrically classified objects to verify whether the reported 0.99 precision and 0.77 recall are reproduced on objects outside the training cross-match.
read the original abstract
This paper presents a comprehensive study of quasar photometric classification and redshift estimation using machine learning techniques. We cross-matched photometric data from the Dark Energy Survey Data Release 2 (DES DR2) with spectroscopic classifications from the Sloan Digital Sky Survey Data Release 16 (SDSS DR16), yielding an initial sample of 168,738 point-like objects. Using a K-Nearest Neighbors (KNN) algorithm with PSF magnitudes in the $g$, $r$, $i$, and $z$ bands, we achieved high-precision quasar/galaxy classification against stellar contaminants, reaching a recall of 0.77 at 0.99 precision. Photometric redshifts were subsequently estimated using a hybrid machine learning approach combining a Boosted Decision Tree from ANNz and a Decision Tree Regressor from scikit-learn. The resulting catalog spans redshifts from $z \approx 0.5$ to $z > 3$, with a distinct population recovered at $z \approx 4$. A stacked outlier classifier was developed to mitigate catastrophic redshift errors. The full photometric redshift sample contains 872,372 objects and remains reliable for cosmological applications at $z \approx 4$. The cleaned catalog contains 675,683 objects and is suitable for large-scale structure studies in the range $0 < z < 3$. This robustly characterized quasar catalog provides a valuable resource for future cosmological investigations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper cross-matches DES DR2 photometry with SDSS DR16 spectroscopy to obtain 168,738 point-like objects and applies a KNN classifier on PSF g,r,i,z magnitudes to separate quasars from galaxies and stars, reporting a recall of 0.77 at 0.99 precision. A hybrid ML pipeline (ANNz boosted decision tree plus scikit-learn decision tree regressor) then produces photometric redshifts, yielding a catalog of 872,372 objects asserted to be reliable for cosmology at z≈4 and a cleaned subset of 675,683 objects for large-scale structure studies between 0<z<3.
Significance. A large, photometrically classified quasar sample extending to z≈4 would be a useful resource for cosmological analyses if the quoted performance metrics generalize beyond the training set. The work demonstrates a practical application of standard ML tools to a wide-field survey.
major comments (2)
- [Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.
- [Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.
minor comments (2)
- [Abstract] Abstract: the redshift range is described as 'z ≈ 0.5 to z > 3, with a distinct population recovered at z ≈ 4'; provide the precise redshift bounds of the final catalog and the criterion used to identify the z≈4 population.
- [Abstract] Abstract: clarify whether the 'stacked outlier classifier' is applied before or after the hybrid photo-z step and how it affects the final sample sizes.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and limitations of our work. We respond to each major comment below and indicate revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.
Authors: We agree that explicit checks for representativeness are needed to support the reliability claim. The manuscript uses the SDSS-DES cross-match as the largest available spectroscopic anchor for DES DR2, but we will revise the abstract and add a new subsection in the methods describing magnitude and color distribution comparisons between the training sample and the full DES point-like photometric population. We will also outline a magnitude-based reweighting scheme and note its limitations for high-z selection biases. revision: yes
-
Referee: [Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.
Authors: The metrics were computed on a held-out test set after 5-fold cross-validation for hyperparameter tuning on the training portion. We will revise the abstract and methods to report bootstrap error bars on the recall and precision, explicitly describe the cross-validation folds, and include a sensitivity plot showing performance stability for K between 3 and 15. This will confirm that the reported values reflect validated choices rather than post-hoc adjustment. revision: yes
Circularity Check
No significant circularity in empirical ML classification and photo-z pipeline
full rationale
The paper applies off-the-shelf KNN and hybrid ML (ANNz BDT + scikit-learn regressor) to a cross-matched DES-SDSS training set of 168738 objects, then reports recall/precision and produces a catalog of 872372 objects. All performance numbers are computed directly against external spectroscopic labels; no functional form is fitted and then re-used as a 'prediction', no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming occurs. The representativeness of the training sample is an empirical assumption whose validity can be tested externally, but it does not make the reported metrics tautological by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- K in KNN
- hyperparameters of boosted decision tree and regressor
axioms (1)
- domain assumption The cross-matched DES-SDSS sample is free of significant selection bias relative to the full photometric population.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a K-Nearest Neighbors (KNN) algorithm with PSF magnitudes in the g, r, i, and z bands, we achieved high-precision quasar/galaxy classification against stellar contaminants, reaching a recall of 0.77 at 0.99 precision.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The full photometric redshift sample contains 872,372 objects and remains reliable for cosmological applications at z≈4.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.