pith. sign in

arxiv: 2605.18218 · v2 · pith:W3V6TSY4new · submitted 2026-05-18 · 🌌 astro-ph.IM · astro-ph.CO

Photometric classification of quasars from DES and photo-z estimation with Machine Learning

Pith reviewed 2026-05-20 00:20 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.CO
keywords quasar classificationphotometric redshiftsmachine learningDark Energy SurveyKNNcosmologylarge-scale structure
0
0 comments X

The pith

KNN on DES photometry classifies quasars at 0.99 precision and builds an 872k-object photo-z catalog

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper cross-matches DES DR2 photometry with SDSS DR16 spectroscopy to assemble a training set of 168,738 point-like objects. A K-nearest neighbors algorithm applied to PSF magnitudes in the g, r, i, and z bands separates quasars from stars and galaxies at 0.99 precision and 0.77 recall. A hybrid machine-learning pipeline that combines boosted decision trees with a decision-tree regressor then estimates photometric redshifts, yielding a sample of 872,372 objects. After cleaning, 675,683 objects remain suitable for large-scale structure work at redshifts below 3, while the full catalog stays reliable for cosmological use near redshift 4.

Core claim

Cross-matching DES DR2 with SDSS DR16 produces a training set of 168,738 objects on which a KNN classifier using four-band PSF magnitudes separates quasars from contaminants at 0.99 precision with 0.77 recall. A hybrid ML approach combining boosted decision trees and a decision tree regressor then estimates photometric redshifts across 872,372 photometric objects, with 675,683 cleaned objects reliable for cosmological applications in the range 0 < z < 3 and the full set useful at z ≈ 4.

What carries the argument

K-Nearest Neighbors classifier on PSF magnitudes in the g, r, i, z bands for quasar selection, followed by a hybrid boosted decision tree plus decision tree regressor pipeline for photometric redshift estimation

Load-bearing premise

The cross-matched training sample of 168,738 objects is representative of the full DES photometric population without significant selection biases or distribution shifts.

What would settle it

Spectroscopic follow-up on a random subset of the photometrically classified objects to verify whether the reported 0.99 precision and 0.77 recall are reproduced on objects outside the training cross-match.

read the original abstract

This paper presents a comprehensive study of quasar photometric classification and redshift estimation using machine learning techniques. We cross-matched photometric data from the Dark Energy Survey Data Release 2 (DES DR2) with spectroscopic classifications from the Sloan Digital Sky Survey Data Release 16 (SDSS DR16), yielding an initial sample of 168,738 point-like objects. Using a K-Nearest Neighbors (KNN) algorithm with PSF magnitudes in the $g$, $r$, $i$, and $z$ bands, we achieved high-precision quasar/galaxy classification against stellar contaminants, reaching a recall of 0.77 at 0.99 precision. Photometric redshifts were subsequently estimated using a hybrid machine learning approach combining a Boosted Decision Tree from ANNz and a Decision Tree Regressor from scikit-learn. The resulting catalog spans redshifts from $z \approx 0.5$ to $z > 3$, with a distinct population recovered at $z \approx 4$. A stacked outlier classifier was developed to mitigate catastrophic redshift errors. The full photometric redshift sample contains 872,372 objects and remains reliable for cosmological applications at $z \approx 4$. The cleaned catalog contains 675,683 objects and is suitable for large-scale structure studies in the range $0 < z < 3$. This robustly characterized quasar catalog provides a valuable resource for future cosmological investigations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper cross-matches DES DR2 photometry with SDSS DR16 spectroscopy to obtain 168,738 point-like objects and applies a KNN classifier on PSF g,r,i,z magnitudes to separate quasars from galaxies and stars, reporting a recall of 0.77 at 0.99 precision. A hybrid ML pipeline (ANNz boosted decision tree plus scikit-learn decision tree regressor) then produces photometric redshifts, yielding a catalog of 872,372 objects asserted to be reliable for cosmology at z≈4 and a cleaned subset of 675,683 objects for large-scale structure studies between 0<z<3.

Significance. A large, photometrically classified quasar sample extending to z≈4 would be a useful resource for cosmological analyses if the quoted performance metrics generalize beyond the training set. The work demonstrates a practical application of standard ML tools to a wide-field survey.

major comments (2)
  1. [Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.
  2. [Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.
minor comments (2)
  1. [Abstract] Abstract: the redshift range is described as 'z ≈ 0.5 to z > 3, with a distinct population recovered at z ≈ 4'; provide the precise redshift bounds of the final catalog and the criterion used to identify the z≈4 population.
  2. [Abstract] Abstract: clarify whether the 'stacked outlier classifier' is applied before or after the hybrid photo-z step and how it affects the final sample sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our work. We respond to each major comment below and indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the 872,372-object catalog is 'reliable for cosmological applications at z≈4' rests on the untested assumption that the 168,738-object SDSS-DES cross-match is representative of the full DES photometric population; no reweighting, domain-adaptation diagnostics, or magnitude-color distribution comparisons are described to address spectroscopic selection biases that are known to affect high-z quasar recovery.

    Authors: We agree that explicit checks for representativeness are needed to support the reliability claim. The manuscript uses the SDSS-DES cross-match as the largest available spectroscopic anchor for DES DR2, but we will revise the abstract and add a new subsection in the methods describing magnitude and color distribution comparisons between the training sample and the full DES point-like photometric population. We will also outline a magnitude-based reweighting scheme and note its limitations for high-z selection biases. revision: yes

  2. Referee: [Abstract] Abstract: the quoted performance (recall 0.77 at 0.99 precision) is given without error bars, cross-validation procedure, or sensitivity analysis to the choice of K or other hyperparameters, so it is impossible to judge whether the metric is robust or whether post-hoc tuning has occurred.

    Authors: The metrics were computed on a held-out test set after 5-fold cross-validation for hyperparameter tuning on the training portion. We will revise the abstract and methods to report bootstrap error bars on the recall and precision, explicitly describe the cross-validation folds, and include a sensitivity plot showing performance stability for K between 3 and 15. This will confirm that the reported values reflect validated choices rather than post-hoc adjustment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML classification and photo-z pipeline

full rationale

The paper applies off-the-shelf KNN and hybrid ML (ANNz BDT + scikit-learn regressor) to a cross-matched DES-SDSS training set of 168738 objects, then reports recall/precision and produces a catalog of 872372 objects. All performance numbers are computed directly against external spectroscopic labels; no functional form is fitted and then re-used as a 'prediction', no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming occurs. The representativeness of the training sample is an empirical assumption whose validity can be tested externally, but it does not make the reported metrics tautological by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central results rest on the representativeness of the spectroscopic training set and on standard assumptions of supervised ML (i.i.d. training and test distributions, appropriate feature choice). No new physical axioms or invented entities are introduced.

free parameters (2)
  • K in KNN
    Hyperparameter controlling neighborhood size for classification; value not stated in abstract but required for the reported precision-recall numbers.
  • hyperparameters of boosted decision tree and regressor
    Learning rate, number of estimators, and tree depth chosen to optimize the hybrid photo-z model.
axioms (1)
  • domain assumption The cross-matched DES-SDSS sample is free of significant selection bias relative to the full photometric population.
    Invoked when training on 168k objects and applying to the larger photometric set.

pith-pipeline@v0.9.0 · 5800 in / 1352 out tokens · 33579 ms · 2026-05-20T00:20:48.187613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.