Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models
Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3
The pith
Linear operator banks can be screened inside PLS and Ridge calibration for near-infrared spectroscopy, matching exhaustive preprocessing accuracy at far lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For strict linear preprocessing operators A the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y and Ridge regression depends on the induced kernel X A^T A X^T. These identities allow a finite operator bank to be screened inside one calibration step while retaining original-wavelength coefficients. On the main regression set of 32 datasets plain compact-bank AOM-PLS records median RMSEP ratios of 0.991 against PLS-default and 0.990 against PLS-HPO; selected branches reach 0.985 and 1.002. Plain AOMRidge-global-compact-none records 0.974 and 0.984 against Ridge baselines, and the selected classifier improves balanced accuracy by 0.159. Median runtime drops from 710.81 s to 1.6
What carries the argument
Operator-induced algebraic identities that absorb a finite bank of linear preprocessing operators into the PLS cross-covariance and the Ridge kernel.
If this is right
- On 32 regression datasets, plain compact-bank AOM-PLS yields median RMSEP ratios of 0.991 versus PLS-default and 0.990 versus PLS-HPO.
- Selected ASLS-AOM-compact-cv5 branches reach ratios of 0.985 and 1.002 on the same references.
- Plain AOMRidge-global-compact-none reaches ratios of 0.974 and 0.984 against Ridge baselines; the selected AOMRidge-Blender improves further to 0.918 and 0.966.
- The selected AOM-PLS-DA classifier improves balanced accuracy by 0.159 with 12 wins out of 13 datasets.
- Median total fitting time falls from 710.81 s for PLS-HPO to 1.63 s for the selected AOM-PLS branch.
Where Pith is reading between the lines
- The same absorption technique could be tested on other linear models whose fitting depends on cross-covariance or kernels.
- Automation of spectroscopy calibration pipelines becomes more feasible once linear operator choice no longer requires repeated external refits.
- Domain experts could evaluate larger banks of operators without incurring proportional compute cost.
- The approach might be applied to other scientific domains that routinely apply linear feature transformations before regression.
Load-bearing premise
The derivation assumes every candidate preprocessing step is a strict linear operator so the cross-covariance and kernel identities hold exactly.
What would settle it
On a new dataset, an AOM-PLS run that includes a non-linear preprocessing variant in its bank produces substantially higher RMSEP than an exhaustive external search over the identical set of steps.
Figures
read the original abstract
Preprocessing screening is often the most expensive part of a near-infrared spectroscopy calibration workflow. It works because smoothing, derivatives, detrending and related filters change the spectral directions seen by PLS or Ridge regression, but a full external search repeatedly refits nearly the same linear model. This paper studies the case where that search can be collapsed into one calibration step. For strict linear preprocessing operators, the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y, and Ridge regression depends on the operator-induced kernel X A^T A X^T. These identities allow a finite operator bank to be screened inside the model while retaining original-wavelength coefficients. Sample-adaptive or fitted corrections such as SNV, MSC, EMSC and ASLS remain fold-local branches, not absorbed into the algebra. The study uses the AOM benchmark cohort: 61 regression rows and 17 classification rows in the manifest. On the main regression denominator (N=32), plain compact-bank AOM-PLS records median RMSEP ratios of 0.991 against PLS-default and 0.990 against PLS-HPO; the selected ASLS-AOM-compact-cv5 branch records 0.985 and 1.002 on the same two references. The plain AOMRidge-global-compact-none baseline records 0.974 against Ridge-default and 0.984 against Ridge-HPO, while the selected AOMRidge-Blender-headline-spxy3 records 0.918 and 0.966. The selected classifier, AOM-PLS-DA-global-simpls-covariance, improves balanced accuracy by 0.159 on N=13 datasets with 12/13 wins. The runtime gap is the practical result: PLS-HPO takes a median total time of 710.81 s per run, whereas the selected AOM-PLS branch takes 1.63 s. Linear operator-adaptive calibration therefore gives comparable prediction quality to exhaustive preprocessing screening, with orders-of-magnitude less fitting time for PLS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that strict linear preprocessing operators can be absorbed into PLS and Ridge regression via algebraic identities on the cross-covariance ((X A^T)^T Y = A X^T Y) and induced kernel (X A^T A X^T), allowing a finite bank of operators to be screened internally during a single calibration rather than via external exhaustive search. Sample-adaptive corrections remain as fold-local branches. On the main regression cohort (N=32), plain compact-bank AOM-PLS yields median RMSEP ratios of 0.991 vs. PLS-default and 0.990 vs. PLS-HPO; selected branches and AOM-Ridge variants are similarly close to 1.0 while reducing median runtime from ~711 s (PLS-HPO) to ~1.6 s. A selected PLS-DA variant improves balanced accuracy by 0.159 on N=13 classification datasets.
Significance. If the algebraic absorption extends exactly to the full iterative PLS procedure, the work offers a practical route to collapse preprocessing screening into model fitting, yielding comparable predictive performance at orders-of-magnitude lower cost. The large-scale benchmark on 32+13 datasets supplies concrete median ratios and runtime numbers; the parameter-free linear-algebra identities and retention of original-wavelength coefficients are additional strengths. The result is most impactful for high-throughput NIR workflows where external HPO is prohibitive.
major comments (1)
- [Methods describing AOM-PLS and the operator absorption] Methods / PLS implementation: The identities hold for the initial cross-covariance, but standard NIPALS/SIMPLS PLS performs successive deflation of X and the current residual Y. It is not shown whether the operator A is reapplied at each deflation step or only to the initial weight vector; if the latter, the multi-component model is not guaranteed to match an external preprocessing-then-PLS pipeline. This equivalence is load-bearing for the central claim that internal screening replicates external screening.
minor comments (2)
- [Abstract and §5] Abstract and results: the manifest contains 61 regression rows yet the main denominator is N=32; state the exact selection rule for the reported cohort and whether it was pre-specified.
- [Results tables] Notation: the compact-bank vs. global-bank distinction and the meaning of suffixes such as 'cv5', 'spxy3', and 'headline' are used in the reported ratios but not defined on first use.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript arXiv:2605.13587. The comment regarding the PLS implementation and the scope of the algebraic identities is well-taken. We have revised the Methods section to provide greater clarity on how the operator absorption is implemented in the context of iterative deflation procedures. Below we respond point-by-point to the major comment.
read point-by-point responses
-
Referee: [Methods describing AOM-PLS and the operator absorption] Methods / PLS implementation: The identities hold for the initial cross-covariance, but standard NIPALS/SIMPLS PLS performs successive deflation of X and the current residual Y. It is not shown whether the operator A is reapplied at each deflation step or only to the initial weight vector; if the latter, the multi-component model is not guaranteed to match an external preprocessing-then-PLS pipeline. This equivalence is load-bearing for the central claim that internal screening replicates external screening.
Authors: We appreciate the referee's observation that the provided algebraic identities apply directly to the initial cross-covariance computation. In the AOM-PLS framework, the operator A is incorporated at the level of the initial covariance matrix to generate candidate weight vectors for the first latent variable. Subsequent deflation steps follow the standard NIPALS or SIMPLS procedure on the deflated matrices without re-applying the preprocessing operator A, as this would require re-transforming the residuals at each iteration, which is not part of the absorption identity. This implementation choice ensures that the final model coefficients remain in the original wavelength space and maintains the computational efficiency. We acknowledge that for multi-component models, this does not guarantee exact numerical equivalence to an external 'preprocess then fit' pipeline for every component. However, our empirical results across the benchmark show that the predictive performance remains comparable, suggesting that the first-component approximation captures the essential benefits of preprocessing screening. To address the concern, we have updated the manuscript with: (1) explicit description of the deflation procedure in the AOM-PLS algorithm, (2) a discussion of the conditions for exact equivalence (limited to single-component or covariance-only methods), and (3) supplementary experiments demonstrating the closeness of internal and external approaches even for multi-component PLS. These revisions clarify the scope of our claims without changing the overall conclusions or runtime advantages reported. revision: yes
Circularity Check
No significant circularity; identities are general linear algebra
full rationale
The paper's derivation rests on the algebraic identities (X A^T)^T Y = A X^T Y and the induced kernel X A^T A X^T, which follow directly from matrix multiplication properties for any linear operator A and are not derived from or equivalent to the paper's empirical results by construction. Performance is assessed via median RMSEP ratios against independent external baselines (PLS-default, PLS-HPO, Ridge-default, Ridge-HPO) on the AOM cohort rather than any self-referential fit. Sample-adaptive corrections are explicitly excluded from absorption. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked; the skeptic concern about deflation affects exact equivalence (a correctness question) but does not reduce any claimed result to its own inputs. This is the common case of a self-contained empirical benchmark with independent algebraic support.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math For strict linear preprocessing operators the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y and Ridge regression depends on the operator-induced kernel X A^T A X^T.
Reference graph
Works this paper leans on
-
[1]
Handbook of Near-Infrared Analysis , author =
-
[2]
Analytica Chimica Acta , volume =
Near infrared spectroscopy: A mature analytical technique with new perspectives , author =. Analytica Chimica Acta , volume =. 2018 , doi =
work page 2018
-
[3]
Multivariate Calibration , author =
-
[4]
Analytica Chimica Acta , volume =
Partial least-squares regression: a tutorial , author =. Analytica Chimica Acta , volume =. 1986 , doi =
work page 1986
-
[5]
Chemometrics and Intelligent Laboratory Systems , volume =
PLS-regression: a basic tool of chemometrics , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2001 , doi =
work page 2001
-
[6]
Chemometrics and Intelligent Laboratory Systems , volume =
SIMPLS: an alternative approach to partial least squares regression , author =. Chemometrics and Intelligent Laboratory Systems , volume =
-
[7]
Ridge regression: biased estimation for nonorthogonal problems , author =. Technometrics , volume =. 1970 , doi =
work page 1970
- [8]
-
[9]
Learning with Kernels , author =
-
[10]
Analytical Chemistry , volume =
Smoothing and differentiation of data by simplified least squares procedures , author =. Analytical Chemistry , volume =. 1964 , doi =
work page 1964
-
[11]
Influence of moisture content on the reflective behavior of grain , author =. Cereal Chemistry , volume =
-
[12]
Applied Spectroscopy , volume =
Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra , author =. Applied Spectroscopy , volume =
-
[13]
Applied Spectroscopy , volume =
Linearization and scatter-correction for near-infrared reflectance spectra of meat , author =. Applied Spectroscopy , volume =. 1985 , doi =
work page 1985
-
[14]
Journal of Pharmaceutical and Biomedical Analysis , volume =
Extended multiplicative signal correction and spectral interference subtraction , author =. Journal of Pharmaceutical and Biomedical Analysis , volume =
-
[15]
Analytical Chemistry , volume =
A perfect smoother , author =. Analytical Chemistry , volume =. 2003 , doi =
work page 2003
-
[16]
Baseline correction using adaptive iteratively reweighted penalized least squares , author =. Analyst , volume =
-
[17]
TrAC Trends in Analytical Chemistry , volume =
Review of the most common pre-processing techniques for near-infrared spectra , author =. TrAC Trends in Analytical Chemistry , volume =. 2009 , doi =
work page 2009
-
[18]
Computer aided design of experiments , author =. Technometrics , volume =
-
[19]
A method for calibration and validation subset partitioning , author =. Talanta , volume =. 2005 , doi =
work page 2005
-
[20]
Journal of Machine Learning Research , volume =
Scikit-learn: Machine learning in Python , author =. Journal of Machine Learning Research , volume =
-
[21]
NIRSpredict: a platform for predicting plant traits from near infra-red spectroscopy , author =. BMC Plant Biology , volume =. 2024 , doi =
work page 2024
- [22]
-
[23]
nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =
Beurier, Gr. nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =. 2026 , note =
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.