pith. sign in

arxiv: 2512.20999 · v2 · submitted 2025-12-24 · 🌌 astro-ph.IM · astro-ph.GA

DRAGNs in the Forest: Identifying Artifacts with Random Forest Models in the VLASS DRAGNs Catalog

Pith reviewed 2026-05-16 20:00 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.GA
keywords random forestVLASSDRAGNsartifactsclassificationradio catalogmachine learningactive galactic nuclei
0
0 comments X

The pith

Random forest models classify VLASS DRAGNs by artifact count to enable extraction of a 99.3% complete and 97.7% artifact-free catalog.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains random forest models to predict whether each double radio source associated with an active galactic nucleus contains zero, one, two, or three imaging artifacts. The best model reaches a weighted F1 score of 97.01 percent with small uncertainties. These predictions are applied to produce a cleaned version of the VLASS DRAGN catalog. A sympathetic reader would care because radio surveys supply large samples for studying distant black holes, yet artifacts can distort counts and properties unless removed.

Core claim

The authors train random forest models to classify DRAGNs according to the number of artifacts they contain, ranging from zero to three. The optimized model attains a weighted F1 score of 97.01%^{+1.12%}_{-1.32%}. Applying these classifications produces a catalog of VLASS DRAGNs from which an estimated 99.3% complete catalog of 97.7% artifact-free sources can be extracted.

What carries the argument

Random forest classifiers trained to predict artifact multiplicity (0-3) per DRAGN using features from the VLASS Quick Look catalog.

Load-bearing premise

The training labels correctly identify the true number of artifacts in each source and the model generalizes to the full catalog without distribution shift.

What would settle it

Independent visual or higher-resolution inspection of a random sample of sources predicted to contain zero artifacts, to verify whether they are actually free of artifacts.

Figures

Figures reproduced from arXiv: 2512.20999 by Eric J. Hooper, Melissa E. Morris, Sarah Bach, Verene Einwalter, Yjan A. Gordon.

Figure 1
Figure 1. Figure 1: Collage of 1.5’x1.5’ VLASS images of triple-component DRAGNs identified by DRAGNhunter. The ellipses denote components as identified by DRAGNhunter, where the green ellipses denote the lobe or jet hot spot component, the cyan ellipse denotes the identified core, and the green X denotes the AllWISE host as identified in Y. A. Gordon et al. (2023), if one was found. (Top line) Examples of what typical DRAGN … view at source ↗
Figure 2
Figure 2. Figure 2: Examples of triple sources with dubious morphologies which are difficult to classify. The ellipses denote components as identified by DRAGNhunter, where the green ellipses denote the lobe or jet hot spot component, the cyan ellipse denotes the identified core, and the green X denotes the AllWISE host as identified in Y. A. Gordon et al. (2023), if one was found. (a) A double with a strong Y-shaped sidelobe… view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix comparing the results of visual inspection for artifacts of this paper and spurious detections of the triples in Y. A. Gordon et al. (2023), where 1 denotes spurious and 0 denotes not spurious. Spurious sources are those that contain artifacts. The percentages and the color of each square are determined by the fraction of the total population of the row, i.e. the fraction of the true class… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Scatterplot of LAS S/N vs. Flux S/N for all triples grouped by number of artifacts in each source as identified by visual inspection. This particular set of parameters shows that the 0-, 2-, and 3-artifact classes of sources cluster in approximately 3 separate areas. (b) Scatterplot of LAS S/N vs. Flux S/N for all doubles which suggests that the double sources may also cluster into artifact classes wit… view at source ↗
Figure 5
Figure 5. Figure 5: 1.5’x1.5’ VLASS cutouts of doubles identified by DRAGNhunter. The ellipses denote components as identified by DRAGNhunter, where the green ellipses denote the lobe or jet hot spot components, and the green X denotes the AllWISE host as identified in Y. A. Gordon et al. (2023), if one was found. (Top line) Examples of what typical 2-artifact doubles look like. Note the similarity between 3-artifact triples … view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrices of the classification results on the verification set (20% of sample) for each of the triples classification runs. The percentages and the color of each square are determined by the fraction of the total population of the row, i.e. the fraction of the true class, that is present in each square. classes, such as LAS S/N, flux S/N, and the prominence of the brightest component in the VLASS… view at source ↗
Figure 7
Figure 7. Figure 7: Plot of the importance of each parameter used in the model from Run 3. 1-artifact triples triples arise mainly from extended sources with one particularly bright lobe, which tends to cause artifacts. The prevalence of these bright lobes explains how some 1-artifact triples occupy the same area in LAS S/N and flux S/N space as 2-artifact sources, because their morphology is similar. Some 1-artifact sources … view at source ↗
Figure 8
Figure 8. Figure 8: Plot of the mean weighted F1 score across 25 randomly seeded runs per point for both random and log-log parameter space training set selection methods. The log-log method using LAS and Flux S/N converges towards the maximum accuracy sooner than with random selection. We performed additional runs with log-log selection varying from 1-15 bins per axis and 30-50 samples per bin to reach approximately the same… view at source ↗
Figure 9
Figure 9. Figure 9: Plots comparing the performance of models trained on sets selected by log-log LAS S/N and flux S/N parameter grid space selection across different combinations of bins per axis and maximum number of samples per bin. Because we define a uniform grid of bins, and our sources are not uniformly distributed across LAS S/N and flux S/N parameter space (as seen in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Confusion matrix comparing the predictions of the triples-trained RF model to the results of visual inspection on the doubles verification set. 4.1.2. Log-Log LAS S/N and Flux S/N Selected Model We tested the efficacy of using log-log LAS S/N and flux S/N grid space selection to select a minimal training set that can still achieve high classification performance. Our testing of selection parameters, the n… view at source ↗
Figure 11
Figure 11. Figure 11: Confusion matrix of the predictions of the model trained on the log-log LAS S/N and flux S/N set of doubles applied to the doubles verification set. 4.2. Doubles Classification Performance and Model Comparison The weighted F1 score of the log-log model is 1.4% higher than the triples-trained model, and the log-log model produces fewer false positives, i.e. real 0-artifact sources classified as containing … view at source ↗
Figure 12
Figure 12. Figure 12: A couple of the sources that confused both the triples-trained and log-log LAS S/N and flux S/N models. The green ellipses denote components as identified by DRAGNhunter, and the green X denotes the AllWISE host as identified in Y. A. Gordon et al. (2023), if one was found. Both are unresolved point sources with prominent sidelobes and represent sources that both models misclassified as 0-artifact when th… view at source ↗
Figure 13
Figure 13. Figure 13: Confusion matrices comparing the efficacy of the DRAGNhunter Q flag and our log-log random forest model in isolating spurious DRAGNs, where non-spurious sources are labeled with a 0 and spurious sources are labeled with a 1. While the random forest model misses some artifact-containing sources filtered out by the Q flag, it retains a larger number of artifact-free sources. In [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 14
Figure 14. Figure 14: Images of sources which were problematic for the source-filtering approaches. (a-c) A representative sample of artifact-free sources which were identified as spurious by the Q flag, but were correctly identified as zero-artifact by our random forest model. (d-e) One-artifact sources which are correctly identified as spurious by the Q flag but were identified as having zero artifacts by our random forest c… view at source ↗
read the original abstract

The Quick Look data products from the Very Large Array Sky Survey (VLASS) contain widespread imaging artifacts arising from the simplified imaging algorithm used in their production. The catalog of double radio sources associated with active galactic nuclei (DRAGNs) found in the VLASS first epoch Quick Look release using the DRAGNhunter algorithm suffers from contamination from these artifacts. These sources contain two or three individual components, each of which can be an artifact. We train random forest models to classify these DRAGNs based on the number of artifacts they contain, ranging from zero to three artifacts. We optimize our models and mitigate the class imbalance of our dataset with judicious training set selection, and the best of our models achieves a weighted F1 score of $97.01\%^{+1.12\%}_{-1.32\%}$. Using our classifications, we produce a catalog of VLASS DRAGNs from which an estimated 99.3% complete catalog of 97.7% artifact-free sources can be extracted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper trains random forest classifiers on human-labeled VLASS DRAGNs to predict the number of imaging artifacts (0–3) per source. The best model reaches a weighted F1 score of 97.01%^{+1.12%}_{-1.32%} on held-out data; applying the model to the full DRAGNhunter catalog yields an estimated 99.3% complete sample that is 97.7% artifact-free.

Significance. If the labeled subset is representative and the model generalizes without distribution shift, the cleaned catalog would be a useful resource for DRAGN population studies. The work demonstrates a practical application of supervised classification to mitigate known imaging artifacts in a large survey catalog.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (model training): the reported weighted F1 score and uncertainty bounds are given without any description of the feature set, the cross-validation procedure used to obtain the uncertainty, or the exact method for mitigating class imbalance beyond “judicious training set selection.” These omissions make it impossible to judge whether the 97% figure is robust or over-optimistic.
  2. [§5] §5 (catalog production): the headline 99.3% completeness and 97.7% artifact-free fractions are obtained by applying the trained classifier to the entire unlabeled DRAGNhunter catalog and then using the model’s predicted artifact fractions. No feature-distribution diagnostics, adversarial validation, or external labeled hold-out drawn from the full catalog are reported, so the translation from test-set F1 to full-catalog purity/completeness rests on an untested representativeness assumption.
minor comments (2)
  1. [Abstract] The abstract states the F1 score to two decimal places but does not define the weighting scheme or the exact class labels used; this should be stated explicitly in the methods section.
  2. [Results figures] Figure captions and axis labels in the results section use inconsistent notation for the artifact classes (e.g., “0 artifacts” vs. “class 0”); standardize throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. We have revised the manuscript to provide the requested methodological details and additional validation diagnostics. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (model training): the reported weighted F1 score and uncertainty bounds are given without any description of the feature set, the cross-validation procedure used to obtain the uncertainty, or the exact method for mitigating class imbalance beyond “judicious training set selection.” These omissions make it impossible to judge whether the 97% figure is robust or over-optimistic.

    Authors: We agree that these details were insufficiently described in the original submission. In the revised manuscript we have expanded §3 with a complete list of the 15 input features (component flux ratios, angular separations, peak-to-total flux ratios, and morphological parameters extracted from the VLASS Quick Look images; now summarized in new Table 2). The reported uncertainty bounds were obtained from 10 repetitions of stratified 5-fold cross-validation; the asymmetric errors are the 16th–84th percentiles of the weighted F1 distribution across all folds. Class imbalance was addressed by a combination of (i) judicious training-set selection to ensure each fold contained at least 20 examples of the minority classes and (ii) inverse-frequency class weighting inside the random-forest implementation. We have added an ablation study confirming that both steps improve minority-class recall. These changes are now fully documented in §3 and the associated supplementary material. revision: yes

  2. Referee: [§5] §5 (catalog production): the headline 99.3% completeness and 97.7% artifact-free fractions are obtained by applying the trained classifier to the entire unlabeled DRAGNhunter catalog and then using the model’s predicted artifact fractions. No feature-distribution diagnostics, adversarial validation, or external labeled hold-out drawn from the full catalog are reported, so the translation from test-set F1 to full-catalog purity/completeness rests on an untested representativeness assumption.

    Authors: We acknowledge that direct external validation on the full catalog is not possible without additional human labeling. In the revision we have added (i) Kolmogorov–Smirnov tests and quantile–quantile plots comparing the distributions of all 15 features between the labeled training set and the full DRAGNhunter catalog (new Figure 8), (ii) an adversarial validation experiment in which a random forest trained to discriminate labeled versus unlabeled sources achieved only 51.8 % accuracy, consistent with no strong distribution shift, and (iii) a sensitivity test in which models trained on random 80 % subsets of the labeled data were applied to the remaining 20 % and yielded stable purity/completeness estimates. We have inserted an explicit discussion of these supporting checks together with the remaining caveat that the quoted 99.3 % / 97.7 % figures assume the labeled subset is representative. These additions appear in the revised §5. revision: partial

Circularity Check

0 steps flagged

No circularity: standard ML classification with held-out F1 and downstream application to full catalog

full rationale

The reported weighted F1 of 97.01% is measured on a held-out test set against human labels. The 99.3% completeness and 97.7% artifact-free estimates are obtained by applying the trained model to the unlabeled full catalog and counting predicted clean sources; these are downstream counts, not quantities defined in terms of the F1 score or fitted parameters by construction. No equations reduce the headline figures to the training inputs, no self-citations are load-bearing for the central claims, and no uniqueness theorems or ansatzes are invoked. The derivation chain is self-contained against the external human-labeled benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human or prior labels correctly identify artifacts and that the random forest decision boundaries learned on the training distribution apply to the full catalog. No new physical entities are postulated.

free parameters (1)
  • random forest hyperparameters
    Number of trees, maximum depth, and feature sampling rates are chosen during optimization but not enumerated in the abstract.
axioms (1)
  • domain assumption Training labels accurately reflect true artifact counts
    The model is supervised; performance metrics assume the ground-truth labels used for training and testing are correct.

pith-pipeline@v0.9.0 · 5497 in / 1274 out tokens · 19104 ms · 2026-05-16T20:00:54.544099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Asadi, V., Haghi, H., & Zonoozi, A. H. 2025, Astronomy and Astrophysics, 700, A259, doi: 10.1051/0004-6361/202555620 Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A, 558, A33, doi: 10.1051/0004-6361/201322068 Astropy Collaboration, Price-Whelan, A. M., Sip˝ ocz, B. M., et al. 2018, AJ, 156, 123, doi: 10.3847/1538-3881/aabc4f As...

  2. [2]

    1979, The Annals of Statistics, 7,

    https://www.jstor.org/stable/2246110 Efron, B. 1979, The Annals of Statistics, 7,

  3. [3]

    2014, Journal of Machine Learning Research, 15,

    https://www.jstor.org/stable/2958830 Fern´ andez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. 2014, Journal of Machine Learning Research, 15,

  4. [4]

    M., Torres, S., Rebassa-Mansergas, A., & Ferrer-Burjachs, A

    http://jmlr.org/papers/v15/delgado14a.html Garc´ ıa-Zamora, E. M., Torres, S., Rebassa-Mansergas, A., & Ferrer-Burjachs, A. 2025, Astronomy and Astrophysics, 699, A3, doi: 10.1051/0004-6361/202554414 Gordon, Y. A., Boyce, M. M., O’Dea, C. P., et al. 2021, The Astrophysical Journal Supplement Series, 255, 30, doi: 10.3847/1538-4365/ac05c0 Gordon, Y. A., Ru...

  5. [5]

    2011, Journal of Machine Learning Research, 12,

    https://ui.adsabs.harvard.edu/abs/2003ASPC..295.....P Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of Machine Learning Research, 12,

  6. [6]

    N., & Boulesteix, A

    http://jmlr.org/papers/v12/pedregosa11a.html Probst, P., Wright, M. N., & Boulesteix, A. 2019, WIREs: Data Mining & Knowledge Discovery, 9, N.PAG, doi: 10.1002/widm.1301 Ramdhanie, S., Gordon, Y. A., Andernach, H., Hooper, E. J., & Sampson, B. 2023, Research Notes of the American Astronomical Society, 7, 243, doi: 10.3847/2515-5172/ad0cc6 Solorio-Ram´ ıre...