Subsampling Bias and The Best-Discrepancy Systematic Cross Validation

Jianya Liu; Liang Guo; Ruodan Lu

arxiv: 1907.02437 · v1 · pith:27D24I4Wnew · submitted 2019-07-04 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Subsampling Bias and The Best-Discrepancy Systematic Cross Validation

Liang Guo , Jianya Liu , Ruodan Lu This is my paper

Pith reviewed 2026-05-25 09:04 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords cross-validationMonte Carlo cross-validationsubsampling biaslow-discrepancy sequencesexpected prediction errorsystematic samplingmodel validationclassification

0 comments

The pith

Replacing the pseudo-random sequence in k-fold cross-validation with a best-discrepancy sequence reduces subsampling bias and produces more accurate expected prediction error estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a k-fold cross-validation procedure that substitutes a best-discrepancy sequence for the conventional pseudo-random sequence when partitioning data into subsets. This change rests on ordered systematic sampling theory and low-discrepancy sequence theory to limit the bias that random splits introduce into performance estimates. Experiments across 156 benchmark datasets using logistic regression, decision trees, and naive Bayes show the new procedure lowers expected prediction error by around 7.18 percent and variance by around 26.73 percent relative to standard Monte Carlo cross-validation. It also runs faster than standard MCCV, stratified MCCV, or leave-one-out methods and yields larger gains on smaller datasets with high aspect ratios. The authors indicate the systematic subsampling approach could extend to other machine learning routines that rely on random sampling.

Core claim

By grounding k-fold cross-validation in best-discrepancy sequences rather than pseudo-random ones, the procedure achieves lower expected prediction error estimates with reduced variance while requiring less computation time.

What carries the argument

best-discrepancy sequence for partitioning instances into k subsets, derived from ordered systematic sampling theory and low-discrepancy sequence theory to ensure low subsampling bias

Load-bearing premise

A best-discrepancy sequence will ensure lower subsampling bias than a pseudo-random sequence for arbitrary datasets and classifiers.

What would settle it

A replication on a new collection of benchmark datasets in which the reported reductions in expected prediction error and variance do not appear or fail to reach statistical significance.

read the original abstract

Statistical machine learning models should be evaluated and validated before putting to work. Conventional k-fold Monte Carlo Cross-Validation (MCCV) procedure uses a pseudo-random sequence to partition instances into k subsets, which usually causes subsampling bias, inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation. Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory, we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence, which ensures low subsampling bias and leads to more precise Expected-Prediction-Error estimates. Experiments with 156 benchmark datasets and three classifiers (logistic regression, decision tree and naive bayes) show that in general, our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18% and the variances around 26.73%. In comparison, the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58% and 11.85% respectively. The Leave-One-Out (LOO) can lower the EPE around 2.50% but its variances are much higher than the any other CV procedure. The computational time of our cross-validation procedure is just 8.64% of the MCCV, 8.67% of the stratified MCCV and 16.72% of the LOO. Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio. This makes our approach particularly pertinent when solving bioscience classification problems. Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They swap pseudo-random splits for best-discrepancy sequences in MCCV and report ~7% lower EPE plus ~27% lower variance across 156 datasets, but the evidence that this actually cuts bias rather than shifting the estimator is indirect.

read the letter

The core move is straightforward: replace the usual random index sequence in Monte Carlo cross-validation with a low-discrepancy sequence drawn from ordered systematic sampling and number-theoretic constructions. The authors run the comparison on 156 benchmark sets with logistic regression, decision trees, and naive Bayes, and they also benchmark against stratified MCCV and LOO. The reported gains are consistent in direction and the procedure runs faster than the baselines they test. That empirical footprint is the paper's main asset; a reader can see the scale of the test and the practical speed claim without needing to accept any deep theory first. The soft spot is exactly the one the stress-test flags. EPE here is the cross-validation estimate itself, so a lower reported EPE does not automatically prove smaller absolute bias. Without a synthetic regime where the true expected prediction error is known in advance, it is possible the new procedure simply produces a different systematic offset. The mapping from continuous low-discrepancy properties to finite, arbitrarily distributed index sets is also not automatic, and the paper does not supply a ground-truth check or a formal bound that would close that gap. The citation pattern looks clean; the work does not lean on self-referential claims. This is a methods refinement aimed at practitioners who already use MCCV on modest-sized or high-aspect-ratio data and want a drop-in replacement that is both faster and empirically stabler on the benchmarks shown. It is not resolving an open theoretical question, but the experimental design is large enough that a referee could usefully pressure-test the bias interpretation and ask for synthetic validation. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes replacing pseudo-random sequences with best-discrepancy sequences (grounded in ordered systematic sampling and low-discrepancy theory) in k-fold Monte Carlo Cross-Validation (MCCV) to reduce subsampling bias and yield more precise Expected Prediction Error (EPE) estimates. Experiments on 156 benchmark datasets using logistic regression, decision trees, and naive Bayes report that the new procedure lowers EPE by ~7.18% and variance by ~26.73% relative to standard MCCV, outperforming stratified MCCV (1.58% EPE, 11.85% variance reduction) and LOO (2.50% EPE reduction but higher variance), while requiring only 8.64% of MCCV compute time; benefits are noted especially for small, high-aspect-ratio datasets.

Significance. If the bias-reduction claim holds, the approach would offer a computationally efficient, lower-variance alternative to MCCV with direct relevance to bioscience classification tasks. The scale of the empirical evaluation (156 datasets, three classifiers) is a positive feature, as is the explicit comparison to stratified MCCV and LOO. However, the absence of an independent ground-truth EPE means the reported improvements cannot yet be interpreted as confirmed bias reduction rather than estimator shift.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the central claim equates lower reported EPE with reduced subsampling bias, yet EPE is the CV estimate itself. Without an independent ground-truth EPE (obtainable via synthetic data with known model parameters or a large external hold-out set), a systematic downward shift in the estimator cannot be distinguished from smaller |bias|. This is load-bearing for the interpretation of the 7.18% and 26.73% figures.
[Theoretical background / Method] Theoretical motivation (low-discrepancy sequence construction): low-discrepancy guarantees equidistribution for quadrature over [0,1]^d, but the paper does not demonstrate that the induced discrete partition of finite, arbitrarily distributed feature vectors inherits the same bias-reduction property for the CV error functional. A concrete counter-example or proof sketch linking the continuous discrepancy bound to the discrete CV bias would be required.
[Experiments / Results] Table/figure reporting the 156-dataset results: aggregate percentages are given without per-dataset or per-classifier breakdowns, confidence intervals, or statistical tests for the claimed superiority. This weakens the “in general” conclusion and the claim that the method is “more beneficial for datasets characterized by relatively small size and large aspect ratio.”

minor comments (2)

Notation: EPE is used both for the true expected prediction error and for its CV estimate; explicit distinction (e.g., EPE vs. ĒPE) would improve clarity.
Computational-time claim (8.64% of MCCV) lacks implementation details (language, hardware, whether the discrepancy sequence is pre-computed) that would allow reproduction.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions to the manuscript are planned.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim equates lower reported EPE with reduced subsampling bias, yet EPE is the CV estimate itself. Without an independent ground-truth EPE (obtainable via synthetic data with known model parameters or a large external hold-out set), a systematic downward shift in the estimator cannot be distinguished from smaller |bias|. This is load-bearing for the interpretation of the 7.18% and 26.73% figures.

Authors: We agree that the manuscript language should more carefully distinguish the CV estimator from the true EPE. The reported figures reflect reductions in the value and variance of the CV-estimated EPE. We will revise the abstract and experiments sections to emphasize improvements to the estimator while adding a note on the inferential nature of the bias-reduction interpretation in the absence of ground truth. This is a partial revision. revision: partial
Referee: [Theoretical background / Method] Theoretical motivation (low-discrepancy sequence construction): low-discrepancy guarantees equidistribution for quadrature over [0,1]^d, but the paper does not demonstrate that the induced discrete partition of finite, arbitrarily distributed feature vectors inherits the same bias-reduction property for the CV error functional. A concrete counter-example or proof sketch linking the continuous discrepancy bound to the discrete CV bias would be required.

Authors: The approach draws on ordered systematic sampling with best-discrepancy sequences to achieve more uniform data partitions. While empirical results support practical benefits, the manuscript does not contain a formal proof or counter-example connecting continuous discrepancy bounds to the discrete CV error functional for arbitrary distributions. revision: no
Referee: [Experiments / Results] Table/figure reporting the 156-dataset results: aggregate percentages are given without per-dataset or per-classifier breakdowns, confidence intervals, or statistical tests for the claimed superiority. This weakens the “in general” conclusion and the claim that the method is “more beneficial for datasets characterized by relatively small size and large aspect ratio.”

Authors: We accept the need for greater granularity. The revised manuscript will include supplementary per-dataset and per-classifier results, confidence intervals on the aggregate metrics, and statistical tests (e.g., paired comparisons) to support the reported superiority. Additional analysis correlating improvement magnitude with dataset size and aspect ratio will also be added. revision: yes

standing simulated objections not resolved

A formal proof or counter-example rigorously linking continuous low-discrepancy properties to bias reduction in the discrete CV error functional for arbitrary data distributions.

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper's central claims rest on direct empirical comparisons of EPE and variance across 156 benchmark datasets against MCCV, stratified MCCV, and LOO. No derivation reduces by construction to fitted parameters, self-defined quantities, or self-citation chains; the method is defined via established low-discrepancy and systematic sampling theory, and results are measured against independent external benchmarks rather than internal fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the application of established sampling theories from statistics and number theory together with empirical comparisons on benchmark datasets. No free parameters are fitted inside the method itself, and no new entities are postulated.

axioms (1)

domain assumption Best-discrepancy sequences drawn from low-discrepancy theory produce lower subsampling bias than pseudo-random sequences when used for k-fold partitioning.
This premise is invoked to justify replacing the random sequence in MCCV.

pith-pipeline@v0.9.0 · 5852 in / 1228 out tokens · 62394 ms · 2026-05-25T09:04:57.933518+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a k-fold CV procedure that is built upon the best-discrepancy systematic subsampling procedure ... EPE(˜D)H,A = 1/K ∑ ˆEPE(Fk)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.