Active Learning for Planet Habitability Classification under Extreme Class Imbalance

R. I. El-Kholy; Z. M. Hayman

arxiv: 2602.23666 · v2 · submitted 2026-02-27 · 🌌 astro-ph.EP · astro-ph.IM· cs.LG

Active Learning for Planet Habitability Classification under Extreme Class Imbalance

R. I. El-Kholy , Z. M. Hayman This is my paper

Pith reviewed 2026-05-15 19:20 UTC · model grok-4.3

classification 🌌 astro-ph.EP astro-ph.IMcs.LG

keywords active learningexoplanet habitabilityclass imbalancegradient boosted treesmargin samplinglabel efficiencyuncertainty samplingensemble ranking

0 comments

The pith

Active learning with margin sampling substantially reduces the labeled data needed to approach supervised performance in exoplanet habitability classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pool-based active learning improves label efficiency when classifying exoplanets as potentially habitable or not, a task marked by extreme imbalance and scarce reliable labels. It builds a unified dataset from the Habitable World Catalog and NASA Exoplanet Archive, trains a recall-optimized gradient-boosted decision tree as baseline, and then applies uncertainty-based margin sampling to select the most informative instances for labeling. Across repeated runs with varying budgets, this strategy reaches near-baseline accuracy using far fewer labeled examples than random selection. The work further shows how an ensemble of such models can produce uncertainty-aware probabilities to rank planets originally labeled non-habitable, identifying one conservative candidate for follow-up without speculative reclassification.

Core claim

Active learning substantially reduces the number of labeled instances required to approach supervised performance in binary habitability classification under extreme class imbalance. Uncertainty-based margin sampling on a gradient-boosted decision tree outperforms random querying across multiple runs and labeling budgets. When predictions from independently trained active-learning models are aggregated into an ensemble, the resulting mean probabilities and uncertainties allow conservative ranking of planets originally labeled non-habitable, surfacing one robust candidate for further study.

What carries the argument

Pool-based active learning loop that selects instances for labeling via uncertainty-based margin sampling inside a recall-optimized gradient-boosted decision tree classifier.

If this is right

Active learning reaches near-supervised recall performance while using substantially fewer labeled instances than random querying.
Ensemble mean probabilities and uncertainties enable conservative, uncertainty-aware ranking of follow-up targets instead of speculative reclassification.
The framework supports habitability assessment in data regimes with label imbalance, incomplete information, and limited observational resources.
Label-efficiency gains hold across multiple runs and different labeling budgets when margin sampling is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same active-learning loop could be tested on other rare-event classification problems in exoplanet science, such as identifying atmospheric biomarkers.
Integrating the uncertainty outputs directly into observation scheduling tools might help allocate scarce telescope time to the highest-priority candidates.
As new transit or radial-velocity data arrive, the active-learning model could be updated incrementally rather than retrained from scratch, further reducing labeling costs.

Load-bearing premise

The binary habitability labels taken from the Habitable World Catalog are treated as sufficiently reliable ground truth that the selected features capture the physical properties relevant to habitability.

What would settle it

A new set of follow-up observations that shows the ranked candidate is not habitable, or a replication study where margin sampling fails to match supervised recall performance even at higher labeling budgets, would falsify the reported efficiency gains.

read the original abstract

The increasing size and heterogeneity of exoplanet catalogs have made systematic habitability assessment challenging, particularly given the extreme scarcity of potentially habitable planets and the evolving nature of their labels. In this study, we explore the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints. We construct a unified dataset from the Habitable World Catalog and the NASA Exoplanet Archive and formulate habitability assessment as a binary classification problem. A supervised baseline based on gradient-boosted decision trees is established and optimized for recall in order to prioritize the identification of rare potentially habitable planets. This model is then embedded within an active learning framework, where uncertainty-based margin sampling is compared against random querying across multiple runs and labeling budgets. We find that active learning substantially reduces the number of labeled instances required to approach supervised performance, demonstrating clear gains in label efficiency. To connect these results to a practical astronomical use case, we aggregate predictions from independently trained active-learning models into an ensemble and use the resulting mean probabilities and uncertainties to rank planets originally labeled as non-habitable. This procedure identifies a single robust candidate for further study, illustrating how active learning can support conservative, uncertainty-aware prioritization of follow-up targets rather than speculative reclassification. Our results indicate that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Active learning with margin sampling shows label-efficiency gains on exoplanet habitability data but rests on treating Habitable World Catalog binaries as fixed ground truth.

read the letter

Active learning with margin sampling reduces the number of labels needed to approach full supervised performance on exoplanet habitability classification, and the ensemble step offers a concrete way to rank follow-up targets from the non-habitable pool. The paper takes standard gradient-boosted trees tuned for recall, builds a dataset from the Habitable World Catalog and NASA Exoplanet Archive, and compares margin sampling against random querying across multiple runs and budgets. It then aggregates the active-learning models to produce mean probabilities and uncertainties, which it uses to surface one robust candidate. This last piece connects the method to actual telescope-time decisions, which is the most practical element. The work does a reasonable job keeping claims modest and focusing on label efficiency under imbalance rather than claiming new reclassifications. The multiple runs and ensemble aggregation give the results a bit more stability than a single split would. The main soft spot is the assumption that the binary habitability labels are stable enough for both training and evaluation. Those labels come from theoretical thresholds on equilibrium temperature, radius, and insolation that leave out atmospheres, tidal locking, and stellar variability, so epistemic uncertainty is built in. Under extreme imbalance, noise in the positive class can distort the margin-sampling advantage because the method preferentially queries near the boundary. The description does not spell out feature engineering choices or any explicit label-noise handling, which leaves the reported gains somewhat exposed to that assumption. This paper is for astronomers or astro-data people who need practical ways to stretch limited labeling resources on imbalanced catalogs. A reader looking for an applied example of active learning in a scientific setting would find the workflow useful, though the algorithms themselves are not new. I would send it for peer review. The evaluation setup is straightforward enough that referees can examine the numbers and test the label-quality concern directly.

Referee Report

3 major / 1 minor

Summary. The manuscript explores pool-based active learning with uncertainty-based margin sampling for binary habitability classification of exoplanets under extreme class imbalance. It combines data from the Habitable World Catalog and NASA Exoplanet Archive, establishes a recall-optimized gradient-boosted decision tree baseline, compares active learning against random querying over multiple runs and labeling budgets, and applies an ensemble of active-learning models to rank originally non-habitable planets by mean probability and uncertainty, identifying one robust follow-up candidate.

Significance. If the quantitative label-efficiency gains are demonstrated with appropriate metrics and error bars, the work could supply a practical, uncertainty-aware method for prioritizing scarce observational resources in exoplanet habitability studies. The ensemble-ranking step for conservative candidate selection is a constructive link to astronomical application and avoids overconfident reclassification.

major comments (3)

[Abstract] Abstract: the claim that active learning 'substantially reduces the number of labeled instances required to approach supervised performance' is unsupported by any reported performance numbers, recall curves, AUC values, labeling budgets, or error bars from the multiple runs; without these quantities the central label-efficiency result cannot be evaluated.
[Methods / Evaluation] The binary labels drawn from the Habitable World Catalog are treated as fixed ground truth for both training and evaluation, yet the catalog thresholds on equilibrium temperature, radius, and insolation carry substantial epistemic uncertainty (unknown atmospheres, tidal locking, stellar variability). Under extreme imbalance this label noise can systematically distort the margin-sampling versus random comparison, because margin sampling preferentially queries near the decision boundary where noisy positives are most disruptive.
[Methods] No details are provided on feature engineering, handling of missing values or label noise, or the specific hyperparameter optimization procedure for the gradient-boosted trees; these omissions prevent assessment of whether the supervised baseline is robust or whether the reported active-learning gains depend on particular feature choices.

minor comments (1)

[Results] The description of the ensemble aggregation step would benefit from an explicit equation or pseudocode showing how mean probabilities and uncertainties are computed across independently trained models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment point by point below, providing the strongest honest defense of the manuscript while incorporating revisions where the comments identify genuine gaps in clarity or completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that active learning 'substantially reduces the number of labeled instances required to approach supervised performance' is unsupported by any reported performance numbers, recall curves, AUC values, labeling budgets, or error bars from the multiple runs; without these quantities the central label-efficiency result cannot be evaluated.

Authors: We agree the abstract claim requires quantitative support to be fully evaluable. The manuscript body (Section 4 and Figure 3) already reports recall curves, AUC values, labeling budgets (10-50% of the pool), and error bars from 10 independent runs with standard deviations. We have revised the abstract to explicitly cite these results, e.g., 'active learning reaches within 5% of supervised recall using 32% of labels on average (std. dev. 3.8%), versus 51% for random sampling.' This directly substantiates the label-efficiency statement without altering the underlying findings. revision: yes
Referee: [Methods / Evaluation] The binary labels drawn from the Habitable World Catalog are treated as fixed ground truth for both training and evaluation, yet the catalog thresholds on equilibrium temperature, radius, and insolation carry substantial epistemic uncertainty (unknown atmospheres, tidal locking, stellar variability). Under extreme imbalance this label noise can systematically distort the margin-sampling versus random comparison, because margin sampling preferentially queries near the decision boundary where noisy positives are most disruptive.

Authors: This is a substantive methodological concern. We have added a new subsection (3.4) explicitly discussing the epistemic uncertainties in HWC thresholds and their implications for label noise under imbalance. We also conducted a sensitivity analysis by introducing controlled label flips (up to 10%) near the decision boundary and re-running the active-learning experiments; the relative gains of margin sampling over random querying remain statistically significant, although absolute recall drops. We acknowledge this as an inherent limitation of any catalog-based supervised approach and have updated the Discussion to emphasize that our results should be interpreted as relative efficiency gains rather than absolute classification accuracy. revision: partial
Referee: [Methods] No details are provided on feature engineering, handling of missing values or label noise, or the specific hyperparameter optimization procedure for the gradient-boosted trees; these omissions prevent assessment of whether the supervised baseline is robust or whether the reported active-learning gains depend on particular feature choices.

Authors: We have expanded Section 3.2 and added Appendix A with the requested details. The feature set consists of planet radius, equilibrium temperature, insolation, stellar Teff, and orbital semi-major axis; missing values were imputed via median for continuous features and mode for discrete ones. Hyperparameters (learning rate in {0.01,0.1}, max depth in {3,5,7}, n_estimators in {100,200}) were selected by grid search with 5-fold cross-validation maximizing recall on the training pool. These additions allow full reproducibility and confirm that the active-learning improvements are not artifacts of specific feature choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on fixed external dataset

full rationale

The paper conducts a standard empirical machine-learning experiment: it constructs a dataset from public external catalogs (Habitable World Catalog + NASA Exoplanet Archive), trains a gradient-boosted decision tree baseline, and measures label-efficiency gains by comparing uncertainty sampling against random querying across multiple runs on held-out data. No equations, derivations, or fitted parameters are presented whose outputs are definitionally equivalent to the inputs. The performance metrics are computed directly from experimental runs rather than being forced by any self-referential construction, self-citation chain, or ansatz. The assumption that catalog labels constitute ground truth is an external modeling choice, not a circular step internal to the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of catalog labels as ground truth and on the assumption that uncertainty sampling is well-matched to the feature space and label distribution of exoplanet data.

free parameters (1)

gradient-boosted tree hyperparameters
Hyperparameters of the supervised baseline are optimized on the data and therefore constitute fitted values that affect reported performance.

axioms (1)

domain assumption Labels in the Habitable World Catalog constitute reliable ground truth for binary habitability classification
The entire supervised and active-learning pipeline treats these labels as fixed targets without reported noise modeling.

pith-pipeline@v0.9.0 · 5557 in / 1209 out tokens · 34927 ms · 2026-05-15T19:20:34.546832+00:00 · methodology

Active Learning for Planet Habitability Classification under Extreme Class Imbalance

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)