Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models

Camille No\^us; Denis Cornet; Gregory Beurier; Lauriane Rouan; Robin Reiter

arxiv: 2605.13587 · v2 · pith:4AFFH5ZOnew · submitted 2026-05-13 · 📊 stat.ML · cs.LG· eess.SP

Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models

Gregory Beurier , Robin Reiter , Camille No\^us , Lauriane Rouan , Denis Cornet This is my paper

Pith reviewed 2026-05-20 21:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LGeess.SP

keywords near-infrared spectroscopyPLS regressionRidge regressionpreprocessing selectionoperator-adaptive calibrationbenchmarkmodel calibration

0 comments

The pith

Linear operator banks can be screened inside PLS and Ridge calibration for near-infrared spectroscopy, matching exhaustive preprocessing accuracy at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that exhaustive external screening of preprocessing steps can be replaced by internal screening within a single PLS or Ridge fit when the steps are strict linear operators. Algebraic identities let the model absorb a finite bank of operators while keeping coefficients in the original wavelengths. Sample-adaptive corrections stay outside as separate branches. On 32 regression datasets the operator-adaptive versions reach median RMSEP ratios of roughly 0.99 against both default and hyperparameter-optimized baselines, with similar or better results for Ridge and classification tasks. The decisive practical gain is runtime: median fitting time falls from hundreds of seconds to roughly two seconds.

Core claim

For strict linear preprocessing operators A the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y and Ridge regression depends on the induced kernel X A^T A X^T. These identities allow a finite operator bank to be screened inside one calibration step while retaining original-wavelength coefficients. On the main regression set of 32 datasets plain compact-bank AOM-PLS records median RMSEP ratios of 0.991 against PLS-default and 0.990 against PLS-HPO; selected branches reach 0.985 and 1.002. Plain AOMRidge-global-compact-none records 0.974 and 0.984 against Ridge baselines, and the selected classifier improves balanced accuracy by 0.159. Median runtime drops from 710.81 s to 1.6

What carries the argument

Operator-induced algebraic identities that absorb a finite bank of linear preprocessing operators into the PLS cross-covariance and the Ridge kernel.

If this is right

On 32 regression datasets, plain compact-bank AOM-PLS yields median RMSEP ratios of 0.991 versus PLS-default and 0.990 versus PLS-HPO.
Selected ASLS-AOM-compact-cv5 branches reach ratios of 0.985 and 1.002 on the same references.
Plain AOMRidge-global-compact-none reaches ratios of 0.974 and 0.984 against Ridge baselines; the selected AOMRidge-Blender improves further to 0.918 and 0.966.
The selected AOM-PLS-DA classifier improves balanced accuracy by 0.159 with 12 wins out of 13 datasets.
Median total fitting time falls from 710.81 s for PLS-HPO to 1.63 s for the selected AOM-PLS branch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same absorption technique could be tested on other linear models whose fitting depends on cross-covariance or kernels.
Automation of spectroscopy calibration pipelines becomes more feasible once linear operator choice no longer requires repeated external refits.
Domain experts could evaluate larger banks of operators without incurring proportional compute cost.
The approach might be applied to other scientific domains that routinely apply linear feature transformations before regression.

Load-bearing premise

The derivation assumes every candidate preprocessing step is a strict linear operator so the cross-covariance and kernel identities hold exactly.

What would settle it

On a new dataset, an AOM-PLS run that includes a non-linear preprocessing variant in its bank produces substantially higher RMSEP than an exhaustive external search over the identical set of steps.

Figures

Figures reproduced from arXiv: 2605.13587 by Camille No\^us, Denis Cornet, Gregory Beurier, Lauriane Rouan, Robin Reiter.

**Figure 1.** Figure 1: External preprocessing screening repeatedly fits transformed pipelines. Operator [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: AOM mathematical structure. AOM-PLS exploits an operator identity on cross [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 2.** Figure 2: AOM mathematical structure. AOM-PLS screens operators through cross-covariances; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Median error reductions and win counts from the available headline result tables. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 3.** Figure 3: Dataset diversity in the AOM cohort. Points show total sample count and number of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Search-budget contrast. Counts for PLS, Ridge, CatBoost and CNN-1D come from the [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 4.** Figure 4: Eight paired regression comparisons for the plain compact-bank AOM baselines [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prediction-quality scatter plots for the selected AOM-PLS and AOM-Ridge variants [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: R2 coverage curves for the eight paper variants on the main regression denominator. Higher curves indicate a larger fraction of datasets exceeding each test-R2 threshold. re-estimates pipelines on small folds and then promotes the winner of that noisy search. The strict-linear AOM step instead selects operators from a covariance or kernel representation of one calibration problem. It does not remove valida… view at source ↗

**Figure 7.** Figure 7: Search-budget scale for the eight paper variants: PLS-default, PLS-HPO, AOM-PLS [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy/time view for all eight paper variants. The horizontal axis uses median [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Runtime distributions for all eight paper variants. PLS-default, PLS-HPO, Ridge [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 1.** Figure 1: Per-dataset compact-bank operator heatmap. Cells count selected components across [PITH_FULL_IMAGE:figures/full_fig_p028_1.png] view at source ↗

**Figure 2.** Figure 2: FastAOM variant comparison ordered by median relative RMSEP. The upper axis shows [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗

**Figure 3.** Figure 3: Per-dataset RMSEP ratios against PLS-standard. Columns are ordered by PLS, AOM [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Per-dataset RMSEP reduction for selected AOM variants. Positive bars indicate lower [PITH_FULL_IMAGE:figures/full_fig_p047_4.png] view at source ↗

read the original abstract

Preprocessing screening is often the most expensive part of a near-infrared spectroscopy calibration workflow. It works because smoothing, derivatives, detrending and related filters change the spectral directions seen by PLS or Ridge regression, but a full external search repeatedly refits nearly the same linear model. This paper studies the case where that search can be collapsed into one calibration step. For strict linear preprocessing operators, the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y, and Ridge regression depends on the operator-induced kernel X A^T A X^T. These identities allow a finite operator bank to be screened inside the model while retaining original-wavelength coefficients. Sample-adaptive or fitted corrections such as SNV, MSC, EMSC and ASLS remain fold-local branches, not absorbed into the algebra. The study uses the AOM benchmark cohort: 61 regression rows and 17 classification rows in the manifest. On the main regression denominator (N=32), plain compact-bank AOM-PLS records median RMSEP ratios of 0.991 against PLS-default and 0.990 against PLS-HPO; the selected ASLS-AOM-compact-cv5 branch records 0.985 and 1.002 on the same two references. The plain AOMRidge-global-compact-none baseline records 0.974 against Ridge-default and 0.984 against Ridge-HPO, while the selected AOMRidge-Blender-headline-spxy3 records 0.918 and 0.966. The selected classifier, AOM-PLS-DA-global-simpls-covariance, improves balanced accuracy by 0.159 on N=13 datasets with 12/13 wins. The runtime gap is the practical result: PLS-HPO takes a median total time of 710.81 s per run, whereas the selected AOM-PLS branch takes 1.63 s. Linear operator-adaptive calibration therefore gives comparable prediction quality to exhaustive preprocessing screening, with orders-of-magnitude less fitting time for PLS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper absorbs linear preprocessing into PLS and Ridge for NIR spectra via simple algebra, cutting search time sharply while matching accuracy on the reported benchmarks, though the PLS deflation equivalence needs explicit checking.

read the letter

The main thing to know is that the authors collapse linear preprocessing search into the model fitting step itself for PLS and Ridge on NIR data. For operators that are strict linear transforms, the cross-covariance identity (X A^T)^T Y = A X^T Y and the induced Ridge kernel let them screen a bank of smoothers or derivatives internally in one pass while keeping coefficients on the original wavelengths. Sample-adaptive steps like SNV stay outside as fold-local branches.

Referee Report

1 major / 2 minor

Summary. The paper claims that strict linear preprocessing operators can be absorbed into PLS and Ridge regression via algebraic identities on the cross-covariance ((X A^T)^T Y = A X^T Y) and induced kernel (X A^T A X^T), allowing a finite bank of operators to be screened internally during a single calibration rather than via external exhaustive search. Sample-adaptive corrections remain as fold-local branches. On the main regression cohort (N=32), plain compact-bank AOM-PLS yields median RMSEP ratios of 0.991 vs. PLS-default and 0.990 vs. PLS-HPO; selected branches and AOM-Ridge variants are similarly close to 1.0 while reducing median runtime from ~711 s (PLS-HPO) to ~1.6 s. A selected PLS-DA variant improves balanced accuracy by 0.159 on N=13 classification datasets.

Significance. If the algebraic absorption extends exactly to the full iterative PLS procedure, the work offers a practical route to collapse preprocessing screening into model fitting, yielding comparable predictive performance at orders-of-magnitude lower cost. The large-scale benchmark on 32+13 datasets supplies concrete median ratios and runtime numbers; the parameter-free linear-algebra identities and retention of original-wavelength coefficients are additional strengths. The result is most impactful for high-throughput NIR workflows where external HPO is prohibitive.

major comments (1)

[Methods describing AOM-PLS and the operator absorption] Methods / PLS implementation: The identities hold for the initial cross-covariance, but standard NIPALS/SIMPLS PLS performs successive deflation of X and the current residual Y. It is not shown whether the operator A is reapplied at each deflation step or only to the initial weight vector; if the latter, the multi-component model is not guaranteed to match an external preprocessing-then-PLS pipeline. This equivalence is load-bearing for the central claim that internal screening replicates external screening.

minor comments (2)

[Abstract and §5] Abstract and results: the manifest contains 61 regression rows yet the main denominator is N=32; state the exact selection rule for the reported cohort and whether it was pre-specified.
[Results tables] Notation: the compact-bank vs. global-bank distinction and the meaning of suffixes such as 'cv5', 'spxy3', and 'headline' are used in the reported ratios but not defined on first use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript arXiv:2605.13587. The comment regarding the PLS implementation and the scope of the algebraic identities is well-taken. We have revised the Methods section to provide greater clarity on how the operator absorption is implemented in the context of iterative deflation procedures. Below we respond point-by-point to the major comment.

read point-by-point responses

Referee: [Methods describing AOM-PLS and the operator absorption] Methods / PLS implementation: The identities hold for the initial cross-covariance, but standard NIPALS/SIMPLS PLS performs successive deflation of X and the current residual Y. It is not shown whether the operator A is reapplied at each deflation step or only to the initial weight vector; if the latter, the multi-component model is not guaranteed to match an external preprocessing-then-PLS pipeline. This equivalence is load-bearing for the central claim that internal screening replicates external screening.

Authors: We appreciate the referee's observation that the provided algebraic identities apply directly to the initial cross-covariance computation. In the AOM-PLS framework, the operator A is incorporated at the level of the initial covariance matrix to generate candidate weight vectors for the first latent variable. Subsequent deflation steps follow the standard NIPALS or SIMPLS procedure on the deflated matrices without re-applying the preprocessing operator A, as this would require re-transforming the residuals at each iteration, which is not part of the absorption identity. This implementation choice ensures that the final model coefficients remain in the original wavelength space and maintains the computational efficiency. We acknowledge that for multi-component models, this does not guarantee exact numerical equivalence to an external 'preprocess then fit' pipeline for every component. However, our empirical results across the benchmark show that the predictive performance remains comparable, suggesting that the first-component approximation captures the essential benefits of preprocessing screening. To address the concern, we have updated the manuscript with: (1) explicit description of the deflation procedure in the AOM-PLS algorithm, (2) a discussion of the conditions for exact equivalence (limited to single-component or covariance-only methods), and (3) supplementary experiments demonstrating the closeness of internal and external approaches even for multi-component PLS. These revisions clarify the scope of our claims without changing the overall conclusions or runtime advantages reported. revision: yes

Circularity Check

0 steps flagged

No significant circularity; identities are general linear algebra

full rationale

The paper's derivation rests on the algebraic identities (X A^T)^T Y = A X^T Y and the induced kernel X A^T A X^T, which follow directly from matrix multiplication properties for any linear operator A and are not derived from or equivalent to the paper's empirical results by construction. Performance is assessed via median RMSEP ratios against independent external baselines (PLS-default, PLS-HPO, Ridge-default, Ridge-HPO) on the AOM cohort rather than any self-referential fit. Sample-adaptive corrections are explicitly excluded from absorption. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked; the skeptic concern about deflation affects exact equivalence (a correctness question) but does not reduce any claimed result to its own inputs. This is the common case of a self-contained empirical benchmark with independent algebraic support.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that preprocessing operators are strict linear maps whose action commutes with the PLS cross-covariance and Ridge kernel in the stated way; the AOM benchmark cohort is treated as representative without further justification in the abstract.

axioms (1)

standard math For strict linear preprocessing operators the transformed PLS cross-covariance satisfies (X A^T)^T Y = A X^T Y and Ridge regression depends on the operator-induced kernel X A^T A X^T.
Invoked in the abstract to justify collapsing the operator bank inside a single calibration step.

pith-pipeline@v0.9.0 · 5936 in / 1397 out tokens · 32814 ms · 2026-05-20T21:02:25.038035+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Handbook of Near-Infrared Analysis , author =

work page
[2]

Analytica Chimica Acta , volume =

Near infrared spectroscopy: A mature analytical technique with new perspectives , author =. Analytica Chimica Acta , volume =. 2018 , doi =

work page 2018
[3]

Multivariate Calibration , author =

work page
[4]

Analytica Chimica Acta , volume =

Partial least-squares regression: a tutorial , author =. Analytica Chimica Acta , volume =. 1986 , doi =

work page 1986
[5]

Chemometrics and Intelligent Laboratory Systems , volume =

PLS-regression: a basic tool of chemometrics , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2001 , doi =

work page 2001
[6]

Chemometrics and Intelligent Laboratory Systems , volume =

SIMPLS: an alternative approach to partial least squares regression , author =. Chemometrics and Intelligent Laboratory Systems , volume =

work page
[7]

Technometrics , volume =

Ridge regression: biased estimation for nonorthogonal problems , author =. Technometrics , volume =. 1970 , doi =

work page 1970
[8]

2009 , doi =

The Elements of Statistical Learning , author =. 2009 , doi =

work page 2009
[9]

Learning with Kernels , author =

work page
[10]

Analytical Chemistry , volume =

Smoothing and differentiation of data by simplified least squares procedures , author =. Analytical Chemistry , volume =. 1964 , doi =

work page 1964
[11]

Cereal Chemistry , volume =

Influence of moisture content on the reflective behavior of grain , author =. Cereal Chemistry , volume =

work page
[12]

Applied Spectroscopy , volume =

Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra , author =. Applied Spectroscopy , volume =

work page
[13]

Applied Spectroscopy , volume =

Linearization and scatter-correction for near-infrared reflectance spectra of meat , author =. Applied Spectroscopy , volume =. 1985 , doi =

work page 1985
[14]

Journal of Pharmaceutical and Biomedical Analysis , volume =

Extended multiplicative signal correction and spectral interference subtraction , author =. Journal of Pharmaceutical and Biomedical Analysis , volume =

work page
[15]

Analytical Chemistry , volume =

A perfect smoother , author =. Analytical Chemistry , volume =. 2003 , doi =

work page 2003
[16]

Analyst , volume =

Baseline correction using adaptive iteratively reweighted penalized least squares , author =. Analyst , volume =

work page
[17]

TrAC Trends in Analytical Chemistry , volume =

Review of the most common pre-processing techniques for near-infrared spectra , author =. TrAC Trends in Analytical Chemistry , volume =. 2009 , doi =

work page 2009
[18]

Technometrics , volume =

Computer aided design of experiments , author =. Technometrics , volume =

work page
[19]

Talanta , volume =

A method for calibration and validation subset partitioning , author =. Talanta , volume =. 2005 , doi =

work page 2005
[20]

Journal of Machine Learning Research , volume =

Scikit-learn: Machine learning in Python , author =. Journal of Machine Learning Research , volume =

work page
[21]

BMC Plant Biology , volume =

NIRSpredict: a platform for predicting plant traits from near infra-red spectroscopy , author =. BMC Plant Biology , volume =. 2024 , doi =

work page 2024
[22]

2026 , note =

Beurier, Gr. 2026 , note =

work page 2026
[23]

nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =

Beurier, Gr. nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =. 2026 , note =

work page 2026

[1] [1]

Handbook of Near-Infrared Analysis , author =

work page

[2] [2]

Analytica Chimica Acta , volume =

Near infrared spectroscopy: A mature analytical technique with new perspectives , author =. Analytica Chimica Acta , volume =. 2018 , doi =

work page 2018

[3] [3]

Multivariate Calibration , author =

work page

[4] [4]

Analytica Chimica Acta , volume =

Partial least-squares regression: a tutorial , author =. Analytica Chimica Acta , volume =. 1986 , doi =

work page 1986

[5] [5]

Chemometrics and Intelligent Laboratory Systems , volume =

PLS-regression: a basic tool of chemometrics , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2001 , doi =

work page 2001

[6] [6]

Chemometrics and Intelligent Laboratory Systems , volume =

SIMPLS: an alternative approach to partial least squares regression , author =. Chemometrics and Intelligent Laboratory Systems , volume =

work page

[7] [7]

Technometrics , volume =

Ridge regression: biased estimation for nonorthogonal problems , author =. Technometrics , volume =. 1970 , doi =

work page 1970

[8] [8]

2009 , doi =

The Elements of Statistical Learning , author =. 2009 , doi =

work page 2009

[9] [9]

Learning with Kernels , author =

work page

[10] [10]

Analytical Chemistry , volume =

Smoothing and differentiation of data by simplified least squares procedures , author =. Analytical Chemistry , volume =. 1964 , doi =

work page 1964

[11] [11]

Cereal Chemistry , volume =

Influence of moisture content on the reflective behavior of grain , author =. Cereal Chemistry , volume =

work page

[12] [12]

Applied Spectroscopy , volume =

Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra , author =. Applied Spectroscopy , volume =

work page

[13] [13]

Applied Spectroscopy , volume =

Linearization and scatter-correction for near-infrared reflectance spectra of meat , author =. Applied Spectroscopy , volume =. 1985 , doi =

work page 1985

[14] [14]

Journal of Pharmaceutical and Biomedical Analysis , volume =

Extended multiplicative signal correction and spectral interference subtraction , author =. Journal of Pharmaceutical and Biomedical Analysis , volume =

work page

[15] [15]

Analytical Chemistry , volume =

A perfect smoother , author =. Analytical Chemistry , volume =. 2003 , doi =

work page 2003

[16] [16]

Analyst , volume =

Baseline correction using adaptive iteratively reweighted penalized least squares , author =. Analyst , volume =

work page

[17] [17]

TrAC Trends in Analytical Chemistry , volume =

Review of the most common pre-processing techniques for near-infrared spectra , author =. TrAC Trends in Analytical Chemistry , volume =. 2009 , doi =

work page 2009

[18] [18]

Technometrics , volume =

Computer aided design of experiments , author =. Technometrics , volume =

work page

[19] [19]

Talanta , volume =

A method for calibration and validation subset partitioning , author =. Talanta , volume =. 2005 , doi =

work page 2005

[20] [20]

Journal of Machine Learning Research , volume =

Scikit-learn: Machine learning in Python , author =. Journal of Machine Learning Research , volume =

work page

[21] [21]

BMC Plant Biology , volume =

NIRSpredict: a platform for predicting plant traits from near infra-red spectroscopy , author =. BMC Plant Biology , volume =. 2024 , doi =

work page 2024

[22] [22]

2026 , note =

Beurier, Gr. 2026 , note =

work page 2026

[23] [23]

nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =

Beurier, Gr. nirs4all: NIRS instrumentation and acquisition toolkit , howpublished =. 2026 , note =

work page 2026