Catalog-based detection of unrecognized blends in deep optical ground based imaging

Anja von der Linden; Prakruth Adari; Shuang Liang; The LSST Dark Energy Science Collaboration

arxiv: 2503.16680 · v2 · pith:GXJYNTLHnew · submitted 2025-03-20 · 🌌 astro-ph.CO

Catalog-based detection of unrecognized blends in deep optical ground based imaging

Shuang Liang , Prakruth Adari , Anja von der Linden , The LSST Dark Energy Science Collaboration This is my paper

Pith reviewed 2026-05-25 08:17 UTC · model grok-4.3

classification 🌌 astro-ph.CO

keywords unrecognized blendsmachine learningcatalog photometryCOSMOSground-based imagingLSSTphotometric redshiftssample purity

0 comments

The pith

Machine learning on catalog colors, magnitudes and sizes can flag 30 to 80 percent of unrecognized blends while rejecting 10 to 50 percent of all detected galaxies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that machine learning algorithms applied only to catalog-level photometry and size can identify a substantial share of unrecognized blends, where multiple objects appear as single detections in ground-based images. This matters because unrecognized blends make up 15 to 30 percent of detections and introduce contamination into galaxy samples used for cosmology. Training on the ground-based COSMOS catalog with HST labels as truth, the authors test Self Organizing Maps, Random Forests, k-Nearest Neighbors and anomaly detection, finding that 17 percent of objects are unrecognized blends. The algorithms recover 30 to 80 percent of those blends while discarding 10 to 50 percent of the full sample, with comparable results when restricted to optical bands plus size. The same methods also remove some photo-z outliers, offering a route to cleaner samples for surveys such as LSST.

Core claim

Catalog-based machine learning algorithms can identify approximately 30% to 80% of unrecognized blends in the ground-based COSMOS catalog while rejecting 10% to 50% of detected galaxies, using only colors, magnitudes, and size information, with HST data serving as the truth label. Some blends remain hard to flag from catalog data alone. The approach improves sample purity and yields similar performance with optical bands only.

What carries the argument

Machine learning classifiers (Self Organizing Map, Random Forest, k-Nearest Neighbors, Anomaly Detection) trained on 9-band photometry plus i-band flux_radius to separate unrecognized blends from single objects.

If this is right

The methods can be used to improve sample purity for cosmological analyses.
Performance remains similar when only optical bands and size information are available.
Algorithms targeting color outliers remove photo-z outliers more effectively than blend-targeted algorithms.
The approach offers a cleaner galaxy sample with lower blending rates for surveys such as LSST, potentially improving cosmological parameter constraints at moderate cost to precision.
Catalog-level information alone suffices to flag a useful fraction of blends without high-resolution imaging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If domain shift proves small, the same trained models could be applied directly to LSST catalog data without new high-resolution labels.
The technique might be stacked with existing deblending or morphological cuts to raise overall blend recovery above the reported 30-80 percent range.
Retraining on survey-specific photometry could mitigate any generalization loss when moving beyond the COSMOS field.

Load-bearing premise

The HST-based labels accurately identify unrecognized blends in the ground-based COSMOS catalog, and the trained models generalize to other fields and surveys without significant domain shift.

What would settle it

A test set from an independent ground-based field with new HST or equivalent high-resolution truth labels that yields blend detection rates outside the 30-80 percent range at comparable rejection fractions would falsify the performance numbers.

Figures

Figures reproduced from arXiv: 2503.16680 by Anja von der Linden, Prakruth Adari, Shuang Liang, The LSST Dark Energy Science Collaboration.

**Figure 1.** Figure 1: —. Illustration of different matching scenarios. Case 1: The ground detection matches a space detection with no other sources in the vicinity (1-1 match). This is a “pure” source. Case 2: The ground detection matches two space sources (1-2 match). This is an unrecognized blend. Case 3: Recognized blends, where two ground-detections are closeby but deblended, and matched to two space sources. They are also … view at source ↗

**Figure 2.** Figure 2: —. Magnitude distribution of the labelled COSMOS sample. Left: dividing the sample into Deblended and Non-Deblended sub-samples by ground-based detection. Right: dividing the sample into pure and unrecognized blends by matching with HST. The magnitude distributions of sub-samples with a spectrum match are shown in both panels. All samples are defined in Sect. 2.4 and Sect. 2.5. Plotted magnitude is the der… view at source ↗

**Figure 3.** Figure 3: —. One example of the SOMs used in this work. The SOM is trained with the training sample described in Sect. 3.1. After training, both the identification sample and the training sample are mapped/re-mapped onto the SOM to identify blending cells. Left: The 𝑟 − 𝑖 + color of the 10-d SOM weight vectors. The weight vectors are scaled back to match the original colors of the training sample for illustration. R… view at source ↗

**Figure 5.** Figure 5: — [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: —. Main results of this work. Removing potential blended sources from the validation sample using different algorithms and quantifying the performance in terms of Recall, the fraction of blends removed, versus Cost, the fraction of all detected sources removed, both defined in Eq. D1. SOM results are shown in green squares, RF in orange circles, LOF in purple triangles, and k-NN in hollow black squares. We… view at source ↗

**Figure 7.** Figure 7: —. Distribution of 𝑖-mag with varying Recall values. Left: Change in i-mag distribution of the total remaining sample. Right: Change in blend fraction per 𝑖-mag bin. The distributions correspond to a range of Recall values between 0 (blue) and 1 (red) with a step size of 0.05. 50 % Recall is shown as a bold purple dashed line and the original sample distribution is shown as a solid black line. Results for … view at source ↗

**Figure 8.** Figure 8: —. Detecting “all” and “strong” blends as defined in Sect. 4.2. Both SOM and RF perform the best using all blends to predict strong blends (All-Strong). The distance cut (Sect. 3.1) is an alternative method for identifying blends with SOM rather than the fiducial “blending cell” method. This method does not have a AA/AS/SS distinction since it only needs the pure training information coded into SOM. The AA… view at source ↗

**Figure 10.** Figure 10: —. Summary statistics on removing photo-z outliers by removing blends. The fraction of photo-z outliers removed from all photo-z outliers (labeled as PZ-out. recall) is plotted against the cost to the all detections on the left and the fraction of blends correctly labeled as blends (blend recall) on the right. The photo-zs are generated using a 10 feature SOM as outlined in Sect.3.5. The SOM results are s… view at source ↗

**Figure 11.** Figure 11: —. Correlation matrix using scoreRF from RF and distance from SOM distance, and all input features colored with dark blue being a correlation of -1 and dark red being a correlation of 1 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: —. Select configurations of training features for RF to highlight the importance of color information. In black we include the performance of RF when trained on all features available as a benchmark. While 𝑖-magnitude and size measurements are very important as shown later in Sect. 4.5, they perform worse than using all optical and NIR colors. Excluding the 𝑖-magnitude but including all color informatio… view at source ↗

**Figure 13.** Figure 13: —. Importance of features ranked by decrease in performance of RF to identify unrecognized blends at 10% Cost of total sample. The horizontal axis names features in increasing importance from left to right and the vertical axis displays the recall at a 10% Cost to the sample when that feature is removed. The loss in recall by removing a feature is displayed in subsequent runs with the color-coded rectangl… view at source ↗

**Figure 14.** Figure 14: —. Comparison of Logarithmic magnitude and asinh Luptitude in the COSMOS 𝑖 + band. The magnitude uncertainties are calculated from median fluxes and flux errors in each S.N.R bin. Note that there is no difference between the two magnitudes at 𝑖 + < 24.5. (galaxies and quasars), which makes up 2% of the final data set. We keep sources with NPIXELS>0 and ZWARN=0 and remove stars (SPECTYPE=‘STAR’). This y… view at source ↗

**Figure 15.** Figure 15: —. Comparison of RF and SOM against a variety of anomaly detection methods. D. CLASSIFICATION METRICS Metrics that are relevant for classification include Cost, Recall, Precision, and Remain, which are defined as follows. Let 𝐵 be the total number of unrecognized blends in the validation sample and 𝑃 the total number of pure galaxies. Then 𝐵 + 𝑃 is the validation sample size. Let 𝑅𝑏 be the number of rem… view at source ↗

**Figure 16.** Figure 16: —. Removing potential blended sources from the validation sample using different algorithms and quantifying the performance in terms of Cost, Recall, Precision, and Remaining Fraction all of which are defined in Eq. D1. The SOM (blending cell) results are shown in green squares, RF in orange circles, LOF in purple triangles, and k-NN in hollow black squares. We include a baseline curve in gray representin… view at source ↗

**Figure 17.** Figure 17: —. Alternative configurations for SOM (left) and RF (right). In all cases, we see no apparent changes in performance [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

In deep, ground-based imaging, about 15%-30% of object detections are expected to correspond to two or more true objects - these are called ``unrecognized blends''. We use Machine Learning algorithms to detect unrecognized blends in deep ground-based photometry using only catalog-level information: colors, magnitude, and size. We compare the performance of Self Organizing Map, Random Forest, k-Nearest Neighbors, and Anomaly Detection algorithms. We test all algorithms on 9-band ($uBVri^{+}z^{++}YJH$) and 1-size (flux_radius in $\textit{i}$-band) measurements of the ground-based COSMOS catalog, and use COSMOS HST data as the truth for unrecognized blend. We find that 17% of objects in the ground-based COSMOS catalog are unrecognized blends. We show that some unrecognized blends can be identified as such using only catalog-level information; but not all blends can be easily identified. Nonetheless, our methods can be used to improve sample purity, and can identify approximately 30% to 80% of unrecognized blends while rejecting 10% to 50% of all detected galaxies (blended or unblended). The results are similar when only optical bands ($uBVri^{+}z^{++}$) and the size information is available. We also investigate the ability of these algorithms to remove photo-z outliers (identified with spectroscopic redshifts), and find that algorithms targeting color outliers perform better than algorithms targeting unrecognized blends. Our method can offer a cleaner galaxy sample with lower blending rates for future cosmological surveys such as the Legacy Survey of Space and Time (LSST), and can potentially improve the accuracy on cosmological parameter constraints at a moderate cost of precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Catalog ML recovers 30-80% of unrecognized blends in COSMOS at modest rejection cost, but the numbers rest on HST labels treated as error-free truth with no domain-shift test.

read the letter

The core result is that random forest, kNN, SOM, and anomaly detection on 9-band colors plus i-band size can flag 30-80% of the unrecognized blends while rejecting 10-50% of the catalog, and the same models also catch some photo-z outliers. They report 17% of the ground-based COSMOS objects are blends according to HST. That is a practical, incremental data point for LSST catalog cleaning work. The paper does a straightforward job of running the four algorithms on the same features, showing that optical-only plus size performs similarly to the full 9-band set, and checking the photo-z side benefit. Those comparisons are useful and the numbers are concrete enough to be worth citing in survey planning discussions. The main soft spot is the label source. HST is used as definitive truth for what counts as an unrecognized blend, yet the abstract gives no error rate on that labeling step, no cross-check against independent deblending, and no test on a second field. Because training and testing stay inside COSMOS, the reported recovery rates do not address how well the classifiers would hold up at different depths or seeing conditions. If label noise is comparable to the signal the models exploit, the purity gains become less reliable. The work is aimed at people building LSST or similar survey pipelines who need catalog-level blend flags. It is honest about its scope and the methods are simple enough to reproduce. I would send it to peer review; the topic is timely and the concrete performance numbers give referees something to evaluate even if the validation needs more work.

Referee Report

3 major / 2 minor

Summary. The paper claims that machine learning algorithms (Self Organizing Map, Random Forest, k-Nearest Neighbors, Anomaly Detection) applied to catalog-level colors, magnitudes, and sizes from 9-band (or optical-only) ground-based COSMOS photometry can detect unrecognized blends, using HST data as truth labels. It reports that 17% of ground-based detections are unrecognized blends and that the methods recover approximately 30% to 80% of them while rejecting 10% to 50% of all detected galaxies, with potential benefits for sample purity and photo-z outlier removal in surveys like LSST.

Significance. If the central performance claims hold after validation, the catalog-only approach would provide a practical, low-overhead tool for reducing blending-induced biases in cosmological analyses from wide-field ground-based data. The use of external HST truth labels is a methodological strength that avoids circularity in the detection task itself.

major comments (3)

[Abstract and results sections] Abstract and results: the quoted recovery rates (30% to 80% blend identification at 10% to 50% rejection) are presented without error bars, cross-validation statistics, train/test split details, or explicit methodology for computing the fractions, which directly undermines assessment of the central performance claim.
[Data and validation sections] HST labeling procedure: the manuscript treats COSMOS HST deblending as definitive ground truth for unrecognized blends without any quantified error rate, agreement metric with ground-based footprints, or sensitivity test to HST depth/seeing differences; this assumption is load-bearing for all reported metrics.
[Discussion and conclusions] Generalization to LSST: no domain-shift experiments are described (all training/testing occurs within the single COSMOS field), yet the conclusion emphasizes applicability to other surveys and depths; this leaves the primary claimed use case untested.

minor comments (2)

[Abstract] The statement that '17% of objects in the ground-based COSMOS catalog are unrecognized blends' lacks a reference to the exact selection or counting procedure used to obtain this fraction.
[Methods] Clarify whether the anomaly detection and SOM implementations use the same feature normalization and hyperparameter choices as the supervised methods, to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful and constructive comments on our manuscript. We have reviewed each major point carefully and provide point-by-point responses below, indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract and results sections] Abstract and results: the quoted recovery rates (30% to 80% blend identification at 10% to 50% rejection) are presented without error bars, cross-validation statistics, train/test split details, or explicit methodology for computing the fractions, which directly undermines assessment of the central performance claim.

Authors: We agree that additional statistical details are needed for a full assessment of the performance claims. The original manuscript describes the algorithms and overall approach but does not prominently report error bars, explicit train/test split ratios, or the precise fraction computation method. In the revised manuscript we will add error bars computed via 5-fold cross-validation, specify the data partitioning procedure, and include a methods subsection detailing how the recovery and rejection fractions are calculated from the confusion matrices. These changes will be reflected in both the abstract and results sections. revision: yes
Referee: [Data and validation sections] HST labeling procedure: the manuscript treats COSMOS HST deblending as definitive ground truth for unrecognized blends without any quantified error rate, agreement metric with ground-based footprints, or sensitivity test to HST depth/seeing differences; this assumption is load-bearing for all reported metrics.

Authors: The referee correctly notes that the HST-based labels are treated as ground truth without quantified uncertainty. While HST resolution is the standard reference for blend identification in COSMOS, we did not include an error-rate estimate or sensitivity tests in the submitted version. We will revise the data and validation sections to add a dedicated paragraph discussing the limitations of this assumption, citing existing literature on HST deblending completeness where available, and performing a limited sensitivity check by varying the HST magnitude limit used for truth labels. Full external validation of HST label errors would require additional datasets beyond the scope of the current work. revision: partial
Referee: [Discussion and conclusions] Generalization to LSST: no domain-shift experiments are described (all training/testing occurs within the single COSMOS field), yet the conclusion emphasizes applicability to other surveys and depths; this leaves the primary claimed use case untested.

Authors: We acknowledge that all experiments were confined to the COSMOS field, as it is the only publicly available dataset combining the required deep ground-based multi-band photometry with high-resolution HST imaging for truth labels. No domain-shift tests across fields or depths were performed. In the revised discussion and conclusions we will temper the language on LSST applicability, explicitly state that COSMOS serves as a proof-of-concept with comparable optical depths, and recommend future validation on independent fields. This revision clarifies the current scope without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML performance against external HST labels

full rationale

The paper trains and evaluates standard supervised and unsupervised ML algorithms (Self Organizing Map, Random Forest, k-Nearest Neighbors, anomaly detection) on ground-based COSMOS catalog features (colors, magnitude, size) with COSMOS HST data serving as independent truth labels for unrecognized blends. Reported metrics (17% blend fraction; 30-80% recovery at 10-50% rejection) are direct empirical comparisons of model outputs to these external labels, not reductions of any equation or fitted parameter to its own inputs by construction. No self-definitional relations, fitted-input-as-prediction steps, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the methodology. The derivation chain is self-contained as a data-driven validation exercise against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no access to methods section, so free parameters, axioms, and invented entities cannot be enumerated from the provided text.

pith-pipeline@v0.9.0 · 5854 in / 1019 out tokens · 23171 ms · 2026-05-25T08:17:21.009419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

We keep sources that are up to 2 mags fainter than the base COSMOS sample (ip_MAG_AUTO < 24.5 ) to al- low for blending with fainter sources

0 < mag_best < 26.5 . We keep sources that are up to 2 mags fainter than the base COSMOS sample (ip_MAG_AUTO < 24.5 ) to al- low for blending with fainter sources

work page
[2]

The large dif- fuse sources would contaminate the matching process 19 for blends if not removed

0 < flux_radius < 10/0.049 We remove sources with negative flux radii, as well as diffuse sources larger than 10 arcsecs. The large dif- fuse sources would contaminate the matching process 19 for blends if not removed. The flux radius is in units of pixels,andtheHSTACSpixelscale 6is0.049arcsec/pix

work page
[3]

The mu_class parameter is a morphology-based star- galaxy separation flag derived from the MU_MAX and MAG_AUTO parameters from SExtractor

mu_class = 1 or 2 . The mu_class parameter is a morphology-based star- galaxy separation flag derived from the MU_MAX and MAG_AUTO parameters from SExtractor. We keep galaxies( mu_class = 1 )andpointsources( mu_class = 2 ), and remove “fake” detections withmu_class = 3. A.2. Ground-Based COSMOS Data We apply the following selections on the ground-based CO...

work page
[4]

Sourcesthatsatisfythesecutsarekept

FLAG_COSMOS=1& FLAG_HJMCC=0& FLAG_PETER=0. Sourcesthatsatisfythesecutsarekept. Thefirsttwoflags mark sources within the COSMOS area that also have UltraVISTA coverage. The last flag removes saturated sources and masked areas in optical bands

work page
[5]

Photometry flags fromSExtractor

b_FLAGS < 4 . Photometry flags fromSExtractor. We keep sources with b_FLAGS = 0, 1 and 2 , which corresponds to isolatedsources,sourceswithbrightneighbors,andrec- ognized blends, respectively. Herebstands for all of the uBVr𝑖+𝑧++YJH bands; sources that satisfy this cut in all bands are kept

work page
[6]

Only unsaturated sources in all bands are kept

b_IMAFLAGS_ISO = 0 . Only unsaturated sources in all bands are kept

work page
[7]

We remove non-observations in any bands

b_FLUXERR> -99. We remove non-observations in any bands. There are only 8 of them

work page
[8]

We remove sources with negative flux radii

FLUX_RADIUS > 0 . We remove sources with negative flux radii

work page
[9]

Sources with extremely low surface brightness are re- moved

ip_FLUX_APER3/FLUX_RADIUS2 < 0.002 . Sources with extremely low surface brightness are re- moved

work page
[10]

Morphology classification based on NIR or3.6 µm ob- servation

TYPE = 0 . Morphology classification based on NIR or3.6 µm ob- servation. Only galaxies are kept

work page
[11]

This magnitude cut at𝑖+ < 24.5 is limited by the depth of the “truth” catalog – the HST COSMOS catalog

ip_MAG_AUTO < 24.5 . This magnitude cut at𝑖+ < 24.5 is limited by the depth of the “truth” catalog – the HST COSMOS catalog. We allow galaxies to blend with sources up to 2 magnitude fainter, and the HST COSMOS catalog is complete at ∼ 26.5

work page
[12]

The COSMOS 30 band photo-z estimated with galaxy SED templates, measured at the median of the likeli- hood distributions

ZPDF > 0 . The COSMOS 30 band photo-z estimated with galaxy SED templates, measured at the median of the likeli- hood distributions. Only sources with positive photo-z measurement are used for training. 6https://hst-docs.stsci.edu/acsdhb/chapter-1-acs-overview/ 1-1-instrument-design-and-capabilities A.3. Spectroscopy Data Weassembleaspectroscopysamplefrom...

work page
[13]

The survey is designed to characterize the galaxyenvironmentsandproducediagnosticinformation on the galaxies and active galactic nuclei

zCOSMOS(Lillyetal.2009;Knobeletal.2012): zCOS- MOS is a large survey in the COSMOS field conducted with the VIMOS spectrograph on the Very Large Tele- scope (VLT). The survey is designed to characterize the galaxyenvironmentsandproducediagnosticinformation on the galaxies and active galactic nuclei. We download the zCOSMOS DR3 catalog which contains 20689...

work page 2009
[14]

2018; Mallery et al

DEIMOS (Hasinger et al. 2018; Mallery et al. 2012): The COSMOS DEIMOS Catalog consists of 10718 ob- jects in the COSMOS field, observed through multi- slit spectroscopy with the Deep Imaging Multi-Object Spectrograph (DEIMOS) on the Keck II telescope. We keep sources with a quality flagQ = 2 or 1.5, represent- ing reliable spectroscopic identification or ...

work page 2018
[15]

2017, 2019; Stanford et al

C3R2 (Masters et al. 2017, 2019; Stanford et al. 2021; Euclid Collaboration et al. 2022): The Complete Cali- bration of the Color-Redshift Relation (C3R2) is a spec- troscopicsurveyatdepth 𝑖 ∼ 24.5. C3R2aimstofillout the galaxy color space with spectroscopic redshifts, to provide a firm foundation for photometric-redshift cali- bration for upcoming weak l...

work page 2017
[16]

2015; Tasca et al

VUDS (Le Fèvre et al. 2015; Tasca et al. 2017): The VIMOS Ultra Deep Survey (VUDS) is a spectroscopic redshiftsurveyof ∼ 10000veryfaintgalaxiestostudythe majorphaseofgalaxyassembly. VUDScovers3separate fields: COSMOS, ECDFS and VVDS-02h, providing an additional 384 sources in the COSMOS field. We keep only 144 sources withzflags = 3 or 4, which havemodera...

work page 2015
[17]

softeningparameter

DESI EDR (DESI Collaboration et al. 2023): The Early Data Release of the Dark Energy Spectroscopic Instru- ment (DESI) contains 1.2 million extra-galactic sources 7https://irsa.ipac.caltech.edu/data/COSMOS/spectra/z-cosmos/ zCOSMOS_DR3.pdf 20 Fig. 14.—. Comparison of Logarithmic magnitude and asinh Lupti- tude in the COSMOS𝑖+band. The magnitude uncertaint...

work page 2023
[18]

The Mahalanobis distance is a measure of how many standard deviations a datum is from a distribution

Elliptical Envelope- EE is an unsupervised algorithm that identifies outliers with a Mahalanobis distance greater than some threshold (Rousseeuw 1984, 1985). The Mahalanobis distance is a measure of how many standard deviations a datum is from a distribution. 𝑑EE = √︃ ( ®𝑥 − ®𝜇)𝑇 𝐶 −1 ( ®𝑥 − ®𝜇). While EE is unsupervised, we use the pure galaxies to estim...

work page 1984
[19]

Local Outlier Factor - LOF is an unsupervised al- gorithm that detects outliers by calculating the local density compared to the nearest neighbors for a data point that was first described in Breunig et al. (2000). Any point that has a low density compared to its near- est neighbors is labelled as an outlier. The density in comparisontoitsneighborsisturne...

work page 2000
[20]

IsolationForest -“iForest”isfirstdescribedinLiuetal. (2008). iForest is created by creating random splits between the minimum and maximum of a feature and then counting the number of splits it takes to uniquely label a datum, the path length. iForest operates by assumingthatoutlierswillhaveashorterpathlengthto isolate them into a terminal node with no oth...

work page 2008
[21]

OneClassSupportVectorMachine -OneClassSVM is described in Schölkopf et al. (2001). This is an unsupervised outlier detector that attempts to create a bounded region (hypersphere) in parameter space that encloses the majority of points which are thought to be inliers. Oneadvantagewiththismethodistheabilityto specify a kernel according to any underlying geo...

work page 2001
[22]

10+10, Euclidean: using Euclidean Distance instead of 𝜒2 distance (equation [4]) for training the SOM and for mapping objects onto the SOM

work page
[23]

10+10, Blends: using both the pure training sample and the identification sample to train the SOM

work page
[24]

Alternative Configurations for RF:

10+10, Counts: Removing cells based on the counts of blends in the cell instead of the ratio of blends to pure galaxies in the cell. Alternative Configurations for RF:

work page
[25]

Create a regression forest that outputs a score between 0 to 1

10+10, Regression: using 10 features for training and testing and treating the labels as integers (0 for pure and 1 for blends). Create a regression forest that outputs a score between 0 to 1. The main text uses a classification forest which is then turned into a score by counting the numberoftreesvotingfor“pure”asoutlinedinSect.3.2. This method directly ...

work page
[26]

pure” and “blend

10+10, 2-class: using 10 features for training and test- ing on a classification forest with the labels “pure” and “blend.” The main text gives three labels: “pure”, “weak”, “strong.”

work page
[27]

These alternative configurations are displayed in Fig

19+19,trainingwithphotometricuncertainties: using19 features for training and testing with the features being 8 colors, 8 color errors, 1magnitude, 1 magnitude error, and flux radius. These alternative configurations are displayed in Fig. 17. We found no evident change in performance in any cases. ThispaperwasbuiltusingtheOpenJournalofAstrophysics LATEX t...

work page

[1] [1]

We keep sources that are up to 2 mags fainter than the base COSMOS sample (ip_MAG_AUTO < 24.5 ) to al- low for blending with fainter sources

0 < mag_best < 26.5 . We keep sources that are up to 2 mags fainter than the base COSMOS sample (ip_MAG_AUTO < 24.5 ) to al- low for blending with fainter sources

work page

[2] [2]

The large dif- fuse sources would contaminate the matching process 19 for blends if not removed

0 < flux_radius < 10/0.049 We remove sources with negative flux radii, as well as diffuse sources larger than 10 arcsecs. The large dif- fuse sources would contaminate the matching process 19 for blends if not removed. The flux radius is in units of pixels,andtheHSTACSpixelscale 6is0.049arcsec/pix

work page

[3] [3]

The mu_class parameter is a morphology-based star- galaxy separation flag derived from the MU_MAX and MAG_AUTO parameters from SExtractor

mu_class = 1 or 2 . The mu_class parameter is a morphology-based star- galaxy separation flag derived from the MU_MAX and MAG_AUTO parameters from SExtractor. We keep galaxies( mu_class = 1 )andpointsources( mu_class = 2 ), and remove “fake” detections withmu_class = 3. A.2. Ground-Based COSMOS Data We apply the following selections on the ground-based CO...

work page

[4] [4]

Sourcesthatsatisfythesecutsarekept

FLAG_COSMOS=1& FLAG_HJMCC=0& FLAG_PETER=0. Sourcesthatsatisfythesecutsarekept. Thefirsttwoflags mark sources within the COSMOS area that also have UltraVISTA coverage. The last flag removes saturated sources and masked areas in optical bands

work page

[5] [5]

Photometry flags fromSExtractor

b_FLAGS < 4 . Photometry flags fromSExtractor. We keep sources with b_FLAGS = 0, 1 and 2 , which corresponds to isolatedsources,sourceswithbrightneighbors,andrec- ognized blends, respectively. Herebstands for all of the uBVr𝑖+𝑧++YJH bands; sources that satisfy this cut in all bands are kept

work page

[6] [6]

Only unsaturated sources in all bands are kept

b_IMAFLAGS_ISO = 0 . Only unsaturated sources in all bands are kept

work page

[7] [7]

We remove non-observations in any bands

b_FLUXERR> -99. We remove non-observations in any bands. There are only 8 of them

work page

[8] [8]

We remove sources with negative flux radii

FLUX_RADIUS > 0 . We remove sources with negative flux radii

work page

[9] [9]

Sources with extremely low surface brightness are re- moved

ip_FLUX_APER3/FLUX_RADIUS2 < 0.002 . Sources with extremely low surface brightness are re- moved

work page

[10] [10]

Morphology classification based on NIR or3.6 µm ob- servation

TYPE = 0 . Morphology classification based on NIR or3.6 µm ob- servation. Only galaxies are kept

work page

[11] [11]

This magnitude cut at𝑖+ < 24.5 is limited by the depth of the “truth” catalog – the HST COSMOS catalog

ip_MAG_AUTO < 24.5 . This magnitude cut at𝑖+ < 24.5 is limited by the depth of the “truth” catalog – the HST COSMOS catalog. We allow galaxies to blend with sources up to 2 magnitude fainter, and the HST COSMOS catalog is complete at ∼ 26.5

work page

[12] [12]

The COSMOS 30 band photo-z estimated with galaxy SED templates, measured at the median of the likeli- hood distributions

ZPDF > 0 . The COSMOS 30 band photo-z estimated with galaxy SED templates, measured at the median of the likeli- hood distributions. Only sources with positive photo-z measurement are used for training. 6https://hst-docs.stsci.edu/acsdhb/chapter-1-acs-overview/ 1-1-instrument-design-and-capabilities A.3. Spectroscopy Data Weassembleaspectroscopysamplefrom...

work page

[13] [13]

The survey is designed to characterize the galaxyenvironmentsandproducediagnosticinformation on the galaxies and active galactic nuclei

zCOSMOS(Lillyetal.2009;Knobeletal.2012): zCOS- MOS is a large survey in the COSMOS field conducted with the VIMOS spectrograph on the Very Large Tele- scope (VLT). The survey is designed to characterize the galaxyenvironmentsandproducediagnosticinformation on the galaxies and active galactic nuclei. We download the zCOSMOS DR3 catalog which contains 20689...

work page 2009

[14] [14]

2018; Mallery et al

DEIMOS (Hasinger et al. 2018; Mallery et al. 2012): The COSMOS DEIMOS Catalog consists of 10718 ob- jects in the COSMOS field, observed through multi- slit spectroscopy with the Deep Imaging Multi-Object Spectrograph (DEIMOS) on the Keck II telescope. We keep sources with a quality flagQ = 2 or 1.5, represent- ing reliable spectroscopic identification or ...

work page 2018

[15] [15]

2017, 2019; Stanford et al

C3R2 (Masters et al. 2017, 2019; Stanford et al. 2021; Euclid Collaboration et al. 2022): The Complete Cali- bration of the Color-Redshift Relation (C3R2) is a spec- troscopicsurveyatdepth 𝑖 ∼ 24.5. C3R2aimstofillout the galaxy color space with spectroscopic redshifts, to provide a firm foundation for photometric-redshift cali- bration for upcoming weak l...

work page 2017

[16] [16]

2015; Tasca et al

VUDS (Le Fèvre et al. 2015; Tasca et al. 2017): The VIMOS Ultra Deep Survey (VUDS) is a spectroscopic redshiftsurveyof ∼ 10000veryfaintgalaxiestostudythe majorphaseofgalaxyassembly. VUDScovers3separate fields: COSMOS, ECDFS and VVDS-02h, providing an additional 384 sources in the COSMOS field. We keep only 144 sources withzflags = 3 or 4, which havemodera...

work page 2015

[17] [17]

softeningparameter

DESI EDR (DESI Collaboration et al. 2023): The Early Data Release of the Dark Energy Spectroscopic Instru- ment (DESI) contains 1.2 million extra-galactic sources 7https://irsa.ipac.caltech.edu/data/COSMOS/spectra/z-cosmos/ zCOSMOS_DR3.pdf 20 Fig. 14.—. Comparison of Logarithmic magnitude and asinh Lupti- tude in the COSMOS𝑖+band. The magnitude uncertaint...

work page 2023

[18] [18]

The Mahalanobis distance is a measure of how many standard deviations a datum is from a distribution

Elliptical Envelope- EE is an unsupervised algorithm that identifies outliers with a Mahalanobis distance greater than some threshold (Rousseeuw 1984, 1985). The Mahalanobis distance is a measure of how many standard deviations a datum is from a distribution. 𝑑EE = √︃ ( ®𝑥 − ®𝜇)𝑇 𝐶 −1 ( ®𝑥 − ®𝜇). While EE is unsupervised, we use the pure galaxies to estim...

work page 1984

[19] [19]

Local Outlier Factor - LOF is an unsupervised al- gorithm that detects outliers by calculating the local density compared to the nearest neighbors for a data point that was first described in Breunig et al. (2000). Any point that has a low density compared to its near- est neighbors is labelled as an outlier. The density in comparisontoitsneighborsisturne...

work page 2000

[20] [20]

IsolationForest -“iForest”isfirstdescribedinLiuetal. (2008). iForest is created by creating random splits between the minimum and maximum of a feature and then counting the number of splits it takes to uniquely label a datum, the path length. iForest operates by assumingthatoutlierswillhaveashorterpathlengthto isolate them into a terminal node with no oth...

work page 2008

[21] [21]

OneClassSupportVectorMachine -OneClassSVM is described in Schölkopf et al. (2001). This is an unsupervised outlier detector that attempts to create a bounded region (hypersphere) in parameter space that encloses the majority of points which are thought to be inliers. Oneadvantagewiththismethodistheabilityto specify a kernel according to any underlying geo...

work page 2001

[22] [22]

10+10, Euclidean: using Euclidean Distance instead of 𝜒2 distance (equation [4]) for training the SOM and for mapping objects onto the SOM

work page

[23] [23]

10+10, Blends: using both the pure training sample and the identification sample to train the SOM

work page

[24] [24]

Alternative Configurations for RF:

10+10, Counts: Removing cells based on the counts of blends in the cell instead of the ratio of blends to pure galaxies in the cell. Alternative Configurations for RF:

work page

[25] [25]

Create a regression forest that outputs a score between 0 to 1

10+10, Regression: using 10 features for training and testing and treating the labels as integers (0 for pure and 1 for blends). Create a regression forest that outputs a score between 0 to 1. The main text uses a classification forest which is then turned into a score by counting the numberoftreesvotingfor“pure”asoutlinedinSect.3.2. This method directly ...

work page

[26] [26]

pure” and “blend

10+10, 2-class: using 10 features for training and test- ing on a classification forest with the labels “pure” and “blend.” The main text gives three labels: “pure”, “weak”, “strong.”

work page

[27] [27]

These alternative configurations are displayed in Fig

19+19,trainingwithphotometricuncertainties: using19 features for training and testing with the features being 8 colors, 8 color errors, 1magnitude, 1 magnitude error, and flux radius. These alternative configurations are displayed in Fig. 17. We found no evident change in performance in any cases. ThispaperwasbuiltusingtheOpenJournalofAstrophysics LATEX t...

work page