A Comparative Evaluation of Sample Selection Algorithms for Multivariate Calibration in Near-Infrared Spectroscopic Analysis of Pharmaceutical Formulations

Aminata Sow; Fadaba Danioko; Harouna Sangar\'e; Tidiane Diallo

read the original abstract

The construction of robust multivariate calibration models for near-infrared (NIR) spectroscopic analysis necessitates careful partitioning of samples into training and validation sets. The selection strategy employed fundamentally influences model generalizability and predictive accuracy. This investigation presents a systematic comparative analysis of four established sample selection algorithms-Duplex, Honigs, Kennard-Stone, and Naes-applied to NIR spectral data acquired from 58 commercial paracetamol tablets. Gaussian process regression (GPR) served as the modeling framework, with model performance quantified through the coefficient of determination (R^2) and root mean square error of prediction (RMSEP). The Kennard-Stone algorithm employing the Mahalanobis distance metric demonstrated superior performance, yielding optimal validation statistics (\(R^2 = 0.99999\), RMSEP = \(1.74 \times 10^{-6}\)). Rigorous non-parametric statistical analysis employing Kruskal-Wallis and post-hoc Mann-Whitney U tests with Bonferroni correction confirmed significant performance differences among algorithms (p < 0.001), while revealing statistical equivalence between Kennard-Stone and Honigs methods. Systematic investigation of training set proportions (60-90%) elucidated the monotonic relationship between calibration set size and predictive accuracy. These findings provide evidence-based guidance for optimizing sample selection protocols in pharmaceutical NIR applications and underscore the critical importance of chemometric validation in spectroscopic method development.

A Comparative Evaluation of Sample Selection Algorithms for Multivariate Calibration in Near-Infrared Spectroscopic Analysis of Pharmaceutical Formulations

discussion (0)