Nearly Optimal Subdata Selection

John Stufken; Ming-Chung Chang; Min Yang; Ting Tian; Wei Zheng; Xueqin Wang

arxiv: 2604.23930 · v1 · submitted 2026-04-27 · 📊 stat.ME · cs.LG· stat.ML

Nearly Optimal Subdata Selection

Min Yang , Wei Zheng , John Stufken , Ming-Chung Chang , Ting Tian , Xueqin Wang This is my paper

Pith reviewed 2026-05-08 02:33 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords subdata selectionoptimal designinformation-based selectionparameter estimationapproximate design theoryefficiency boundsconvergenceNP-hard selection

0 comments

The pith

A new algorithm based on approximate design theory selects subdata that nearly maximizes information for parameter estimation in general models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a methodology that uses optimal approximate design theory to choose a subset of n points from N available data points so that the selected subdata retains nearly the maximum possible information for estimating parameters in a parametric statistical model. It introduces an algorithm that works for arbitrary N and n, supports different optimality criteria such as D-optimality, and comes with a proof that the algorithm converges to a solution whose efficiency approaches that of the theoretical optimum. The same framework also supplies tight lower and upper bounds on the efficiency of any chosen subdata, allowing direct comparison with other selection methods. A reader would care because full datasets are often too large or too expensive to label, yet discarding most points can destroy the precision of downstream estimates if the choice is poor.

Core claim

Based on optimal approximate design theory, the authors develop a new methodology for information-based subdata selection that produces subdata approaching the optimal solution; they supply a novel algorithm applicable to general parametric models, arbitrary N and n, and multiple optimality criteria, prove its convergence, and demonstrate that the resulting subdata is highly efficient and outperforms existing methods while also providing tight bounds for assessing any subdata's efficiency.

What carries the argument

The novel algorithm that converts information optimality from continuous approximate designs into a convergent discrete selection procedure for exact subdata of size n.

If this is right

The selected subdata approaches the information-optimal choice for parameter estimation under standard criteria.
The algorithm converges for any combination of total points N and desired subset size n.
The method applies directly to general parametric models rather than being restricted to linear regression.
Tight efficiency bounds become available to evaluate subdata chosen by any other procedure.
The same subdata outperforms all previously published selection methods in efficiency comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In large-scale data pipelines the approach could cut labeling or processing costs while preserving most of the statistical precision that full data would deliver.
The convergence guarantee may encourage extensions to streaming or online settings where new points arrive sequentially.
Because bounds are supplied for any method, the work supplies a practical benchmark that future subdata algorithms can be measured against.

Load-bearing premise

The assumption that the information optimality defined for continuous approximate designs can be discretized into a selection of exact data points that retains nearly the same efficiency for general parametric models and arbitrary sizes.

What would settle it

For small N where exhaustive search is feasible, compute the true information-maximizing subdata of size n and check whether the algorithm's output reaches an efficiency within the paper's claimed bounds of that true optimum.

read the original abstract

When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size $n$ from $N$ available data points is which $n$ points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of $N$ and $n$, and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to develop a new methodology for information-based subdata selection of size n from N points in parametric models, grounded in optimal approximate design theory. It introduces a novel algorithm applicable to general models, arbitrary N and n, and multiple optimality criteria; proves convergence of the algorithm; derives tight lower and upper bounds on efficiency to assess any subdata selection method; and asserts that the resulting subdata approaches the optimal solution, is highly efficient, and outperforms all existing methods.

Significance. If the convergence proof and efficiency bounds extend rigorously to the exact discrete case with a controlled optimality gap, the work would provide a theoretically grounded, practical tool for subdata selection that bridges approximate design theory to finite-sample problems. This could improve efficiency assessments and performance over heuristics in large-scale or costly-labeling settings, with the general-model applicability and bound-facilitated comparisons as notable strengths.

major comments (1)

[Abstract] Abstract and the sections developing the algorithm and convergence proof: the central claim that the methodology produces subdata that 'approaches the optimal solution' and 'outperforms all existing methods' is load-bearing on the translation from the continuous relaxation (fractional weights) to the exact discrete selection of n points. The convergence and bounds are established for the approximate design, but no general bound is given on the optimality gap for finite N and n in general (possibly nonlinear) parametric models, where the gap between approximate and exact information matrices can be large.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this key point about the relationship between the approximate design relaxation and the exact discrete subdata selection problem. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and the sections developing the algorithm and convergence proof: the central claim that the methodology produces subdata that 'approaches the optimal solution' and 'outperforms all existing methods' is load-bearing on the translation from the continuous relaxation (fractional weights) to the exact discrete selection of n points. The convergence and bounds are established for the approximate design, but no general bound is given on the optimality gap for finite N and n in general (possibly nonlinear) parametric models, where the gap between approximate and exact information matrices can be large.

Authors: We agree that the convergence proof and the tight efficiency bounds are derived for the optimal approximate design (the continuous relaxation). The algorithm constructs an exact subdata selection of size n by using the optimal approximate weights, and the manuscript shows that this yields subdata whose information matrix is close to that of the approximate optimum. However, we do not provide a general, model-independent bound on the gap between the approximate and exact information matrices for arbitrary nonlinear models and finite n, as such a bound would necessarily be model-specific and can indeed be large in some pathological cases. We will revise the abstract and the relevant sections to clarify that 'nearly optimal' and 'approaches the optimal solution' refer to proximity to the optimal approximate design, with the exact subdata inheriting high efficiency under the conditions where the rounding step preserves most of the information. We will also add a brief discussion of the approximate-to-exact gap, including references to known results on the size of this gap and numerical illustrations from our simulations showing that the gap remains small for the models considered. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation builds on external optimal design theory with independent algorithmic contribution

full rationale

The paper grounds its methodology in established optimal approximate design theory (an external body of results) and introduces a novel algorithm for subdata selection that applies to general parametric models, accommodates arbitrary N and n, supports multiple criteria, and includes a convergence proof. Efficiency assessment via tight lower/upper bounds and the claim of high efficiency relative to existing methods follow directly from this construction without any step reducing by definition to its own inputs, without fitted parameters renamed as predictions, and without load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The central claims remain independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of approximate design optimality to exact discrete selection and on the existence of a convergent algorithm for general models; no free parameters or invented entities are evident from the abstract.

axioms (1)

domain assumption Optimal approximate design theory provides a usable relaxation for the NP-hard exact subdata selection problem in parametric models.
The entire methodology is built on converting the discrete problem into an approximate continuous one.

pith-pipeline@v0.9.0 · 5523 in / 1279 out tokens · 59298 ms · 2026-05-08T02:33:09.120968+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Ahipaşaoğlu, S. D. (2021), ‘A branch-and-bound algorithm for the exact optimal experi- mental design problem’,Statistics and Computing31(5),

work page 2021
[2]

M., Mahmood, A

Bagirov, A. M., Mahmood, A. & Barton, A. (2017), ‘Prediction of monthly rainfall in Victoria, Australia: Clusterwise linear regression approach’,Atmospheric Research188. 15 Brusco, M. J., Cradit, J. D. & Tashchian, A. (2003), ‘Multicriterion Clusterwise Regression for Joint Segmentation Settings: An Application to Customer Value’,Journal of Marketing Rese...

work page 2017
[3]

& Yang, J

Huang, Y., Tong, L. & Yang, J. (2023), ‘Constrained d-optimal design for paid research study’,Statistica Sinicato appear. Isaacson, E. & Keller, H. B. (1966),Analysis of numerical methods, John Wiley & Sons Inc. Joseph, V. R. & Vakayil, A. (2022), ‘Split: An optimal method for data splitting’,Techno- metrics64, 166–176. Khadka, M. & Paz, A. (2017), ‘Compr...

work page 2023
[4]

& Stufken, J

Wang, H., Yang, M. & Stufken, J. (2019), ‘Information-Based Optimal Subdata Selection for BigDataLinearRegression’,Journal of the American Statistical Association114(525),393–

work page 2019
[5]

Wang, L., Elmstedt, J., Wong, W. K. & Xu, H. (2021), ‘Orthogonal subsampling for big data linear regression’,The Annals of Applied Statistics15(3), 1273–1290. Wang, Y., Yu, A. W. & Singh, A. (2017), ‘On computationally tractable selection of experiments in measurement-constrained regression models’,Journal of Machine Learning Research18, 1–41. Welch, W. (...

work page 2021
[6]

The information matrix ofξP can be viewed as a function ofwww′=(w1,...,wn2−1), wherewi denotes the weight forxi∈X2, i = 1,...,n 2−1

Letn2≥2, which is the only case for which a proof is needed. The information matrix ofξP can be viewed as a function ofwww′=(w1,...,wn2−1), wherewi denotes the weight forxi∈X2, i = 1,...,n 2−1. The weight ofxn2∈X2 is1 −n3/n−∑n2−1 i=1 wi. By Theorem 4 of Yang et al. (2013), the optimal weights are critical points, that is, solutions of∂Φ(ξP) ∂wi = 0, or bo...

work page 2013
[7]

We recommend that an adaptation of the subdata selection method in Wang et al

B.1 Selecting initial subdata Starting Algorithm 1 in line 4 with efficient subdata will generally result in a shorter running time to reach convergence. We recommend that an adaptation of the subdata selection method in Wang et al. (2019) and Cheng et al. (2020) be used to select the initial subdata. The models considered by these authors use a predictor...

work page 2019
[8]

The computational complexity of this algorithm isO(Np 2). Although the subdata generated by this fast algorithm may not be highly efficient for all models, it provides a starting point for Algorithm 1 that tends to be far better than using randomly selected subdata. B.2 Deriving optimal weights Common algorithms for deriving optimal weights, such as the m...

work page 2011
[9]

Thus, optimal weights can be obtained by solving nonlinear equations

With constraints0≤wi≤1/n and ∑N i=1wi = 1, the same result holds, except that boundary points are now those points withwi = 0or wi = 1/n for at least onei. Thus, optimal weights can be obtained by solving nonlinear equations. Normally, there is no closed-form solution, so we use a numerical approach based on Newton’s method. In general, this has the desir...

work page 1966
[10]

because the size ofX2 changes when a weight becomes equal to 0 or1/n. Algorithm 2Optimal Weights Algorithm 1: Input: The input for Algorithm 1; a bounded approximate designξP with0 ≤wi≤1/n and the corresponding partitionP = (X1,X 2,X 3)with X2̸=∅; a positive scalarϵ1; the maximum number of iterationsMI1. 2: Output: Updated partitionP = (X1,X 2,X 3)and opt...

work page 2025
[11]

Algorithm 1 typically converges before it reachesMI 2, but even when it doesn’t, little change occurs after increasing the number of iterations

As a practical recommendation, we suggest thatMI 2 be set ton. Algorithm 1 typically converges before it reachesMI 2, but even when it doesn’t, little change occurs after increasing the number of iterations. 31 Table 3: Mean and standard deviation (in parentheses) ofΦ(ξIBOSS OBD )/Φ(ξFO)under the A-optimality criterion MI3 n10n20n50n Φ(ξIBOSS OBD)/Φ(ξFO)9...

work page 2021
[12]

Table 6: Scenario (ii):A-optimality criterion for all parameters (N = 100000and n =

Algorithm SRS LEV SEQ FO IBOSS+ IBOSS++ IBOSS OBD Mean Eff (%) 9.16 10.04 71.33 73.18 93.05 99.82 100.00 Std Eff (%) 0.67 0.60 0.76 0.56 1.63 0.26 0.00 Mean Time (s) 0.0002 0.0582 0.2009 2.3319 0.1341 0.2837 1.0313 Std Time (s) 0.0000 0.0198 0.0348 0.2831 0.0176 0.0385 0.2577 Mean(size) = 938.54, Std(size) = 41.76; convergence rates of IBOSS OBD is 0.99. ...

work page 2009
[13]

SRS and LEV remain the fastest methods but have low efficiencies (less than 20%)

Algorithm SRS LEV FO IBOSS+ IBOSS++ IBOSS OBD Mean Eff (%) 16.60 17.11 99.99 96.08 99.30 100.00 Std Eff (%) 0.88 0.51 0.00 1.19 0.02 0.00 Mean Time (s) 0.0002 0.0606 43.9571 0.1563 0.3221 1.5092 Std Time (s) 0.0002 0.0100 1.6908 0.0135 0.0264 0.3608 Convergence rates: FO = 0.00 and IBOSS OBD = 1.00. SRS and LEV remain the fastest methods but have low effi...

work page 2003
[14]

The SEQ approach starts by selecting the first5×110 = 550data points

For this relatively more complex setting, we exclude the LEV and FO methods because they depend on a specific matrix structure that is not directly applicable here. The SEQ approach starts by selecting the first5×110 = 550data points. With a target subdata size of 1000, it would only allow judicious selection of 450 data points. Therefore, to give SEQ a b...

work page 2002

[1] [1]

Ahipaşaoğlu, S. D. (2021), ‘A branch-and-bound algorithm for the exact optimal experi- mental design problem’,Statistics and Computing31(5),

work page 2021

[2] [2]

M., Mahmood, A

Bagirov, A. M., Mahmood, A. & Barton, A. (2017), ‘Prediction of monthly rainfall in Victoria, Australia: Clusterwise linear regression approach’,Atmospheric Research188. 15 Brusco, M. J., Cradit, J. D. & Tashchian, A. (2003), ‘Multicriterion Clusterwise Regression for Joint Segmentation Settings: An Application to Customer Value’,Journal of Marketing Rese...

work page 2017

[3] [3]

& Yang, J

Huang, Y., Tong, L. & Yang, J. (2023), ‘Constrained d-optimal design for paid research study’,Statistica Sinicato appear. Isaacson, E. & Keller, H. B. (1966),Analysis of numerical methods, John Wiley & Sons Inc. Joseph, V. R. & Vakayil, A. (2022), ‘Split: An optimal method for data splitting’,Techno- metrics64, 166–176. Khadka, M. & Paz, A. (2017), ‘Compr...

work page 2023

[4] [4]

& Stufken, J

Wang, H., Yang, M. & Stufken, J. (2019), ‘Information-Based Optimal Subdata Selection for BigDataLinearRegression’,Journal of the American Statistical Association114(525),393–

work page 2019

[5] [5]

Wang, L., Elmstedt, J., Wong, W. K. & Xu, H. (2021), ‘Orthogonal subsampling for big data linear regression’,The Annals of Applied Statistics15(3), 1273–1290. Wang, Y., Yu, A. W. & Singh, A. (2017), ‘On computationally tractable selection of experiments in measurement-constrained regression models’,Journal of Machine Learning Research18, 1–41. Welch, W. (...

work page 2021

[6] [6]

The information matrix ofξP can be viewed as a function ofwww′=(w1,...,wn2−1), wherewi denotes the weight forxi∈X2, i = 1,...,n 2−1

Letn2≥2, which is the only case for which a proof is needed. The information matrix ofξP can be viewed as a function ofwww′=(w1,...,wn2−1), wherewi denotes the weight forxi∈X2, i = 1,...,n 2−1. The weight ofxn2∈X2 is1 −n3/n−∑n2−1 i=1 wi. By Theorem 4 of Yang et al. (2013), the optimal weights are critical points, that is, solutions of∂Φ(ξP) ∂wi = 0, or bo...

work page 2013

[7] [7]

We recommend that an adaptation of the subdata selection method in Wang et al

B.1 Selecting initial subdata Starting Algorithm 1 in line 4 with efficient subdata will generally result in a shorter running time to reach convergence. We recommend that an adaptation of the subdata selection method in Wang et al. (2019) and Cheng et al. (2020) be used to select the initial subdata. The models considered by these authors use a predictor...

work page 2019

[8] [8]

The computational complexity of this algorithm isO(Np 2). Although the subdata generated by this fast algorithm may not be highly efficient for all models, it provides a starting point for Algorithm 1 that tends to be far better than using randomly selected subdata. B.2 Deriving optimal weights Common algorithms for deriving optimal weights, such as the m...

work page 2011

[9] [9]

Thus, optimal weights can be obtained by solving nonlinear equations

With constraints0≤wi≤1/n and ∑N i=1wi = 1, the same result holds, except that boundary points are now those points withwi = 0or wi = 1/n for at least onei. Thus, optimal weights can be obtained by solving nonlinear equations. Normally, there is no closed-form solution, so we use a numerical approach based on Newton’s method. In general, this has the desir...

work page 1966

[10] [10]

because the size ofX2 changes when a weight becomes equal to 0 or1/n. Algorithm 2Optimal Weights Algorithm 1: Input: The input for Algorithm 1; a bounded approximate designξP with0 ≤wi≤1/n and the corresponding partitionP = (X1,X 2,X 3)with X2̸=∅; a positive scalarϵ1; the maximum number of iterationsMI1. 2: Output: Updated partitionP = (X1,X 2,X 3)and opt...

work page 2025

[11] [11]

Algorithm 1 typically converges before it reachesMI 2, but even when it doesn’t, little change occurs after increasing the number of iterations

As a practical recommendation, we suggest thatMI 2 be set ton. Algorithm 1 typically converges before it reachesMI 2, but even when it doesn’t, little change occurs after increasing the number of iterations. 31 Table 3: Mean and standard deviation (in parentheses) ofΦ(ξIBOSS OBD )/Φ(ξFO)under the A-optimality criterion MI3 n10n20n50n Φ(ξIBOSS OBD)/Φ(ξFO)9...

work page 2021

[12] [12]

Table 6: Scenario (ii):A-optimality criterion for all parameters (N = 100000and n =

Algorithm SRS LEV SEQ FO IBOSS+ IBOSS++ IBOSS OBD Mean Eff (%) 9.16 10.04 71.33 73.18 93.05 99.82 100.00 Std Eff (%) 0.67 0.60 0.76 0.56 1.63 0.26 0.00 Mean Time (s) 0.0002 0.0582 0.2009 2.3319 0.1341 0.2837 1.0313 Std Time (s) 0.0000 0.0198 0.0348 0.2831 0.0176 0.0385 0.2577 Mean(size) = 938.54, Std(size) = 41.76; convergence rates of IBOSS OBD is 0.99. ...

work page 2009

[13] [13]

SRS and LEV remain the fastest methods but have low efficiencies (less than 20%)

Algorithm SRS LEV FO IBOSS+ IBOSS++ IBOSS OBD Mean Eff (%) 16.60 17.11 99.99 96.08 99.30 100.00 Std Eff (%) 0.88 0.51 0.00 1.19 0.02 0.00 Mean Time (s) 0.0002 0.0606 43.9571 0.1563 0.3221 1.5092 Std Time (s) 0.0002 0.0100 1.6908 0.0135 0.0264 0.3608 Convergence rates: FO = 0.00 and IBOSS OBD = 1.00. SRS and LEV remain the fastest methods but have low effi...

work page 2003

[14] [14]

The SEQ approach starts by selecting the first5×110 = 550data points

For this relatively more complex setting, we exclude the LEV and FO methods because they depend on a specific matrix structure that is not directly applicable here. The SEQ approach starts by selecting the first5×110 = 550data points. With a target subdata size of 1000, it would only allow judicious selection of 450 data points. Therefore, to give SEQ a b...

work page 2002