Farthest Point Sampling in Property Designated Chemical Feature Space as a General Strategy for Enhancing the Machine Learning Model Performance for Small Scale Chemical Dataset

Xi Yu; Yuze Liu

arxiv: 2404.11348 · v1 · pith:SPNTVUW7new · submitted 2024-04-17 · ⚛️ physics.chem-ph · physics.data-an

Farthest Point Sampling in Property Designated Chemical Feature Space as a General Strategy for Enhancing the Machine Learning Model Performance for Small Scale Chemical Dataset

Yuze Liu , Xi Yu This is my paper

Pith reviewed 2026-05-24 02:02 UTC · model grok-4.3

classification ⚛️ physics.chem-ph physics.data-an

keywords farthest point samplingmachine learningchemical datasetssmall scale datasetsoverfitting reductionchemical feature spacepredictive accuracyphysicochemical properties

0 comments

The pith

Farthest point sampling in a property-designated chemical feature space improves machine learning model performance on small chemical datasets over random sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests farthest point sampling as a strategy for choosing training points for machine learning models when working with small, unbalanced chemical datasets. It shows that models built on these selected sets achieve higher predictive accuracy, greater robustness, and less overfitting than models built on randomly chosen sets. The gains appear most clearly when the training set is small, because the method increases the spread of the data across the feature space tied to the target property. This matters for chemistry and materials science, where experimental labels are often scarce and models trained on limited data tend to memorize rather than generalize.

Core claim

Farthest point sampling within property-designated chemical feature spaces generates well-distributed training datasets that enable machine learning models, including artificial neural networks, support vector machines, and random forests, to reach superior predictive accuracy, robustness, and reduced overfitting on small-scale datasets for properties such as standard boiling points and enthalpy of vaporization.

What carries the argument

Farthest point sampling (FPS) applied inside a property-designated chemical feature space, which repeatedly selects the point farthest from all already chosen points to maximize coverage and diversity.

If this is right

FPS-selected training sets produce models with higher predictive accuracy than random selection across the tested algorithms.
The reduction in overfitting is largest when the overall dataset size is small.
The same sampling approach improves performance for multiple physicochemical properties and multiple model types.
Diversity in the chemical feature space of the training data is the direct cause of the observed gains in generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the chosen features fail to track the property of interest, FPS may select points that are distant yet irrelevant.
The method could be combined with active learning loops that update the feature space as new labels arrive.
Similar distance-based selection might help in other experimental sciences that rely on small labeled sets.

Load-bearing premise

The chemical feature space used for distance calculations must be chosen so that greater distance corresponds to greater relevance for the target property.

What would settle it

A direct comparison on a new small chemical dataset where models trained on FPS-selected points show equal or lower accuracy and higher overfitting than models trained on randomly selected points of the same size.

Figures

Figures reproduced from arXiv: 2404.11348 by Xi Yu, Yuze Liu.

**Figure 2.** Figure 2: (a) MSE of training and test sets of the ANN model under the Farthest Point Sampling method (yellow) and Random Sampling (blue) at different training sizes on the boiling point dataset. (b) The difference between Test MSE and Train MSE (ΔMSE), along with the errors associated with FPS and RS at various training sizes was analyzed. A training size of 0.6, which showed the lowest ΔMSE, indicates the least ov… view at source ↗

**Figure 3.** Figure 3: (a) MSE of training and test sets by random sampling (red), FPS in the interpretable space (blue), regression space (yellow), and casually selected space (green). (b) The ΔMSE along with the errors under different sampling methods at various training sizes was analysed. (c) Training and Test loss curves with different sampling methods during the training process at a training size of 0.3. The test loss cu… view at source ↗

**Figure 5.** Figure 5: Comparison of MSE for training and test sets in the ANN model using FPS (yellow) and RS (blue) across various training sizes on physicochemical datasets. These datasets include the enthalpy of vaporization (HVAP), critical temperature (CT), critical volume (CV), and critical pressure (CP). 0.8. In these heatmaps, the upper and lower triangles denote the test and training sets, respectively. A bluer color i… view at source ↗

**Figure 6.** Figure 6: SNE Plot depicting the distribution of boiling point data points. Probability density illustrates the original data distribution, while grey points represent samples obtained through FPS and RS methods. Points sampled by FPS tend to occur more frequently in areas that are difficult to cluster, with fewer appearing in zones of well-defined clustering. Moreover, we have expanded our study to encompass a va… view at source ↗

read the original abstract

Machine learning model development in chemistry and materials science often grapples with the challenge of small scale, unbalanced labelled datasets, a common limitation in scientific experiments. These dataset imbalances can precipitate overfit ting and diminish model generalization. Our study explores the efficacy of the farthest point sampling (FPS) strategy within target ed chemical feature spaces, demonstrating its capacity to generate well-distributed training datasets and consequently enhance model performance. We rigorously evaluated this strategy across various machine learning models, including artificial neural net works (ANN), support vector machines (SVM), and random forests (RF), using datasets encapsulating physicochemical properties like standard boiling points and enthalpy of vaporization. Our findings reveal that FPS-based models consistently surpass those trained via random sampling, exhibiting superior predictive accuracy and robustness, alongside a marked reduction in overfitting. This improvement is particularly pronounced in smaller training datasets, attributable to increased diversity within the training data's chemical feature space. Consequently, FPS emerges as a universally effective and adaptable approach in approaching high performance machine learning models by small and biased experimental datasets prevalent in chemistry and materials science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FPS in a property-designated feature space beats random sampling on small chem datasets, but the designation step is unvalidated and no metrics are shown.

read the letter

Hey, the main thing here is that farthest point sampling inside a property-designated chemical feature space produces training sets that give higher accuracy and less overfitting than random sampling, with the edge clearest on the smallest sets. They test this on boiling point and vaporization enthalpy data using ANN, SVM, and random forest models. The application to experimental chemistry datasets is straightforward and the multi-model check is a plus. It shows the expected benefit from greater diversity when data is limited, which lines up with how these sampling methods usually work. The execution is competent in outline for an applied paper. The soft spots are real though. The abstract supplies no numbers, error bars, or statistical tests, so the size of any gain stays unknown. The central move is the 'property designated' space, yet nothing is said about how features are selected or weighted, and there is no ablation or correlation check to confirm distances track the target property. If designation already folds in property info, the reported improvement could trace to that step rather than FPS itself. The stress-test note on this point holds up from the abstract alone. This is for people who train regression models on small, biased experimental chemistry data and need a practical sampling recipe. A reader facing similar constraints would get some usable ideas if the methods section adds the missing controls and numbers. I would send it to peer review because the claim is testable and the setting is relevant, even if heavy revision on the feature space validation and reporting will be required.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that farthest-point sampling (FPS) performed inside a 'property-designated' chemical feature space produces more diverse training subsets than random sampling, yielding higher predictive accuracy, greater robustness, and reduced overfitting for ANN, SVM, and RF models on small datasets of boiling points and enthalpies of vaporization; the benefit is asserted to be especially pronounced for the smallest training-set sizes.

Significance. If the performance gains are shown to be statistically robust and the designation step is shown not to be the sole source of improvement, the work would supply a concrete, low-cost protocol for mitigating the small-data problem that is ubiquitous in experimental chemistry. The multi-model, multi-property evaluation is a positive feature, but the absence of any quantitative metrics or controls in the abstract leaves the magnitude and reliability of the claimed gains unassessable from the summary alone.

major comments (3)

[Abstract] Abstract: the central empirical claim ('FPS-based models consistently surpass those trained via random sampling, exhibiting superior predictive accuracy and robustness, alongside a marked reduction in overfitting') is stated without any numerical values, error bars, or statistical tests. The full manuscript must supply, at minimum, tables or figures reporting R², MAE, or RMSE for FPS versus random sampling across the tested models and properties, together with the number of independent trials and significance tests.
[Methods / Results] The 'property designated chemical feature space' is load-bearing for the novelty claim, yet no quantitative validation is referenced (e.g., correlation between Euclidean distances in the chosen space and the target property, or an ablation comparing FPS in the designated space versus a generic descriptor space). Without such a check it remains possible that any observed gain arises from the feature-selection step rather than from the FPS procedure itself.
[Results] The manuscript asserts that the improvement is 'particularly pronounced in smaller training datasets' but supplies no explicit scaling study (e.g., performance versus training-set size curves for both sampling methods). A figure or table showing this dependence with error bars is required to substantiate the size-dependent claim.

minor comments (3)

[Abstract] Abstract contains typographical errors: 'overfit ting' (should be 'overfitting'), 'net works' (should be 'networks').
[Abstract] The phrase 'target ed' appears with an extraneous space; similar spacing issues should be corrected throughout.
[Methods] No mention is made of the specific molecular descriptors or the procedure used to 'designate' the feature space; this information must be supplied with sufficient detail for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas to strengthen the manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim ('FPS-based models consistently surpass those trained via random sampling, exhibiting superior predictive accuracy and robustness, alongside a marked reduction in overfitting') is stated without any numerical values, error bars, or statistical tests. The full manuscript must supply, at minimum, tables or figures reporting R², MAE, or RMSE for FPS versus random sampling across the tested models and properties, together with the number of independent trials and significance tests.

Authors: We agree that including quantitative metrics improves clarity. The full manuscript already presents tables and figures with R², MAE, and RMSE values for FPS versus random sampling across ANN, SVM, and RF models for both boiling points and enthalpies of vaporization, along with results from multiple independent trials. We will revise the abstract to incorporate representative numerical values (e.g., average R² improvements and MAE reductions) and reference the number of trials and any statistical tests reported in the results. revision: yes
Referee: [Methods / Results] The 'property designated chemical feature space' is load-bearing for the novelty claim, yet no quantitative validation is referenced (e.g., correlation between Euclidean distances in the chosen space and the target property, or an ablation comparing FPS in the designated space versus a generic descriptor space). Without such a check it remains possible that any observed gain arises from the feature-selection step rather than from the FPS procedure itself.

Authors: We acknowledge the need for explicit validation of the property-designated space. In the revised manuscript, we will add quantitative checks, including correlations (e.g., Pearson) between Euclidean distances in the designated feature space and differences in the target property values. We will also include an ablation comparing FPS performance in the property-designated space versus a generic descriptor space to isolate the contribution of the designation step from the FPS procedure. revision: yes
Referee: [Results] The manuscript asserts that the improvement is 'particularly pronounced in smaller training datasets' but supplies no explicit scaling study (e.g., performance versus training-set size curves for both sampling methods). A figure or table showing this dependence with error bars is required to substantiate the size-dependent claim.

Authors: We agree that an explicit scaling study would better substantiate the claim. Although results for multiple training-set sizes are presented, we will add a new figure in the revised manuscript plotting performance metrics (R², MAE) versus training-set size for both FPS and random sampling, with error bars from repeated trials to clearly demonstrate the size-dependent advantage. revision: yes

Circularity Check

0 steps flagged

Empirical sampling comparison with no derivation chain or self-referential predictions

full rationale

The manuscript reports an empirical study comparing FPS versus random sampling for training ML models (ANN, SVM, RF) on small chemical datasets for properties such as boiling point and vaporization enthalpy. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce the reported performance gains to inputs by construction. The central claim rests on standard train/test splits and cross-validation benchmarks external to any author-defined fit, satisfying the criteria for a self-contained result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract introduces no new free parameters, axioms beyond the standard definition of FPS, or invented entities; the approach rests on conventional machine-learning practice and the assumption that a suitable feature space can be designated from domain knowledge.

axioms (1)

standard math Farthest point sampling selects the point maximally distant from the current selected set in the chosen metric.
This is the algorithmic definition invoked by the title and abstract.

pith-pipeline@v0.9.0 · 5725 in / 1253 out tokens · 32040 ms · 2026-05-24T02:02:28.881265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

I.; Mitchell, T

(1) Jordan, M. I.; Mitchell, T. M. Machine Learning: Trends, Per- spectives, and Prospects. Science. 2015, 349(6245), 255-260. (2) Butler, K. T.; Davies, D. W.; Cartwright, H.; et al. Machine Learning for Molecular and Materials Science. Nature. 2018, 559, 547-555. (3) Keith, J. A.; Vassilev -Galindo, V.; Cheng, B.; Chmiela, S.; Gastegger, M.; Mü ller, K....

work page 2015
[2]

Machine Learning of Spec- tra-Property Relationship for Imperfect and Small Chemistry Data

(6) Chong, Y.; Huo, Y.; Jiang, S.; et al. Machine Learning of Spec- tra-Property Relationship for Imperfect and Small Chemistry Data. Proc. Natl. Acad. Sci. U. S. A. 2023, 120(20), e2220789120. (7) Wang, X.; Jiang, S.; Hu, W.; et al. Quantitatively Determining Surface–Adsorbate Properties from Vibrational Spectroscopy with Interpretable Machine Learning. ...

work page 2023
[3]

(14) Dou, B.; Zhu, Z.; Merkurjev, E.; Ke, L.; Chen, L.; Jiang, J.; Zhu, Y.; Liu, J.; Zhang, B.; Wei, G. -W. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem. Rev. 2023, 123(13), 8736-8780. (15) Li, Y. -X.; Chai, Y.; Hu, Y. -Q.; Yin, H. -P. Review of Imbal- anced Data Classification Methods. Control. Decis. 2019, 34, 673–688. (...

work page 2023
[4]

(21) Ng, W. W. Y.; Yeung, D. S.; Cloete, I. Input Sample Selection for RBF Neural Network Classification Problems Using Sensi- tivity Measure. In SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernet- ics. Conference Theme - System Security and Assurance; Washington, DC, USA, 2003; Vol. 3, pp 2593-2598. (22) Smith,...

work page arXiv 2003
[5]

National Center for Biotechnology Information

Available from: http://app.knovel.com/hotlink/toc/id:kpYCPDCECD/yaws- critical-property/yaws-critical-property (32) PubChem. National Center for Biotechnology Information. Retrieved from https://pubchem.ncbi.nlm.nih.gov/ (33) Mauri, A. alvaDesc: A Tool to Calculate and Analyze Molecu- lar Descriptors and Fingerprints. In Ecotoxicological QSARs. Methods in...

work page 2020
[6]

Random Forests

(36) Breiman, L. Random Forests. Machine Learning. 2001, 45(1), 5-32. (37) Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining (KDD '16); Association for Computing Machinery: New York, NY, USA, 2016; pp 785-794. (38) He, H.; Garcia, E. A. L...

work page 2001

[1] [1]

I.; Mitchell, T

(1) Jordan, M. I.; Mitchell, T. M. Machine Learning: Trends, Per- spectives, and Prospects. Science. 2015, 349(6245), 255-260. (2) Butler, K. T.; Davies, D. W.; Cartwright, H.; et al. Machine Learning for Molecular and Materials Science. Nature. 2018, 559, 547-555. (3) Keith, J. A.; Vassilev -Galindo, V.; Cheng, B.; Chmiela, S.; Gastegger, M.; Mü ller, K....

work page 2015

[2] [2]

Machine Learning of Spec- tra-Property Relationship for Imperfect and Small Chemistry Data

(6) Chong, Y.; Huo, Y.; Jiang, S.; et al. Machine Learning of Spec- tra-Property Relationship for Imperfect and Small Chemistry Data. Proc. Natl. Acad. Sci. U. S. A. 2023, 120(20), e2220789120. (7) Wang, X.; Jiang, S.; Hu, W.; et al. Quantitatively Determining Surface–Adsorbate Properties from Vibrational Spectroscopy with Interpretable Machine Learning. ...

work page 2023

[3] [3]

(14) Dou, B.; Zhu, Z.; Merkurjev, E.; Ke, L.; Chen, L.; Jiang, J.; Zhu, Y.; Liu, J.; Zhang, B.; Wei, G. -W. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem. Rev. 2023, 123(13), 8736-8780. (15) Li, Y. -X.; Chai, Y.; Hu, Y. -Q.; Yin, H. -P. Review of Imbal- anced Data Classification Methods. Control. Decis. 2019, 34, 673–688. (...

work page 2023

[4] [4]

(21) Ng, W. W. Y.; Yeung, D. S.; Cloete, I. Input Sample Selection for RBF Neural Network Classification Problems Using Sensi- tivity Measure. In SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernet- ics. Conference Theme - System Security and Assurance; Washington, DC, USA, 2003; Vol. 3, pp 2593-2598. (22) Smith,...

work page arXiv 2003

[5] [5]

National Center for Biotechnology Information

Available from: http://app.knovel.com/hotlink/toc/id:kpYCPDCECD/yaws- critical-property/yaws-critical-property (32) PubChem. National Center for Biotechnology Information. Retrieved from https://pubchem.ncbi.nlm.nih.gov/ (33) Mauri, A. alvaDesc: A Tool to Calculate and Analyze Molecu- lar Descriptors and Fingerprints. In Ecotoxicological QSARs. Methods in...

work page 2020

[6] [6]

Random Forests

(36) Breiman, L. Random Forests. Machine Learning. 2001, 45(1), 5-32. (37) Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining (KDD '16); Association for Computing Machinery: New York, NY, USA, 2016; pp 785-794. (38) He, H.; Garcia, E. A. L...

work page 2001