A Moment-assisted Approach for Improving Subsampling-based MLE with Large-scale data

Miaomiao Su; Qihua Wang; Ruoyu Wang

arxiv: 2309.09872 · v4 · submitted 2023-09-18 · 📊 stat.ME

A Moment-assisted Approach for Improving Subsampling-based MLE with Large-scale data

Miaomiao Su , Qihua Wang , Ruoyu Wang This is my paper

Pith reviewed 2026-05-24 06:55 UTC · model grok-4.3

classification 📊 stat.ME

keywords moment-assisted subsamplingmaximum likelihood estimationgeneralized method of momentslarge-scale dataasymptotic efficiencysubsampling methodsstatistical estimation

0 comments

The pith

Subsampling-based maximum likelihood estimation can match full-data efficiency by incorporating optimal sample moments from the entire dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a moment-assisted subsampling (MAS) estimator that augments standard subsampling MLE with generalized method of moments constraints drawn from sample moments of the full dataset. These moments are cheap to compute even when the full sample is enormous, allowing the method to reduce the efficiency loss that normally accompanies subsampling. The resulting estimator is asymptotically normal, and its asymptotic variance is strictly smaller than that of the plain subsampled MLE. The authors derive the optimal moment that minimizes this variance in the Loewner order and show that, when this moment is used, the MAS estimator attains the same asymptotic efficiency as the maximum likelihood estimator computed on the complete data. The approach is illustrated on models whose likelihoods involve integrals that make full-data MLE computationally prohibitive.

Core claim

The MAS estimator, formed by combining a subsampled likelihood with GMM constraints from full-data moments, is asymptotically normal with asymptotic variance that is smaller than the subsampled MLE alone and that equals the full-data MLE variance when the optimal moment is incorporated.

What carries the argument

The moment-assisted subsampling (MAS) estimator constructed via generalized method of moments that augments the subsampled likelihood score with whole-data sample moments.

If this is right

The asymptotic variance of the MAS estimator is strictly smaller than that of the corresponding subsampled MLE without moment augmentation.
When the optimal moment is used, the MAS estimator achieves the same asymptotic efficiency as the full-data maximum likelihood estimator.
The method remains computationally fast because the additional moments can be calculated in linear time even for massive datasets.
The efficiency gain holds for likelihoods that contain intractable integrals where direct full-data MLE is prohibitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same moment-augmentation idea could be applied to other computationally intensive estimators such as M-estimators or quasi-likelihood methods.
In settings where multiple candidate moments exist, a data-driven selection rule would be needed to approximate the optimal moment without knowing the true parameter.
The approach suggests that pre-computing a small set of low-cost statistics can serve as a general bridge between subsampling and full-data analysis in distributed or streaming environments.

Load-bearing premise

The chosen sample moments are correctly specified under the model and do not introduce bias when combined with the subsampled likelihood.

What would settle it

A simulation with known true parameters in which the asymptotic variance of the MAS estimator using the derived optimal moment exceeds the variance of the full-data MLE would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2309.09872 by Miaomiao Su, Qihua Wang, Ruoyu Wang.

**Figure 2.** Figure 2: The logarithm of MSE ratio of the plain estimator to [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: NB and NSE of different estimators under the mixed effec [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: The logarithm of MSE ratio of the plain estimator to [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: The logarithm of RE of the uniform subsampling-bas [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: The logarithm of MSE ratio of the plain estimator to [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: NB and NSE of different estimators under the logistic [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗

**Figure 8.** Figure 8: NB and NSE of different estimators under the mixed effec [PITH_FULL_IMAGE:figures/full_fig_p045_8.png] view at source ↗

**Figure 9.** Figure 9: Norm of bias vector (NB) and standard error vector ( [PITH_FULL_IMAGE:figures/full_fig_p047_9.png] view at source ↗

**Figure 10.** Figure 10: The logarithm of MSE ratio of the plain estimator t [PITH_FULL_IMAGE:figures/full_fig_p047_10.png] view at source ↗

read the original abstract

The maximum likelihood estimation is computationally demanding for large datasets, particularly when the likelihood function includes integrals. Subsampling can reduce the computational burden, but it often results in efficiency loss.This paper proposes a moment-assisted subsampling (MAS) method that can improve the estimation efficiency of existing subsampling-based maximum likelihood estimators.The motivation behind this approach stems from the fact that sample moments can be efficiently computed even if the sample size of the whole data set is huge.Through the generalized method of moments, the proposed method incorporates informative sample moments of the whole data. The MAS estimator can be computed rapidly and is asymptotically normal with a smaller asymptotic variance than the corresponding estimator without incorporating sample moments of the whole data. The asymptotic variance of the proposed estimator depends on the specific sample moments incorporated. We derive the optimal moment that minimizes the resulting asymptotic variance in terms of Loewner order. The proposed MAS estimator can achieve the same estimation efficiency as the whole data-based estimator when the optimal moment is incorporated. Numerical results demonstrate the promising performance of the proposed method in both estimation and computational efficiency compared with existing subsampling methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAS combines subsampled MLE with full-data moments via GMM for variance reduction, but the dependence between subsample score and full moments needs explicit handling in the joint asymptotics.

read the letter

The paper's main move is to take an existing subsampled MLE and augment it with cheap full-sample moments inside a GMM framework, then pick the moment that minimizes the asymptotic variance in Loewner order. When the optimal moment is used, the claim is that you recover the efficiency of the full-data MLE. The numerical examples show the expected gains in mean squared error with little added runtime, which is the practical payoff for large-scale likelihood problems where the likelihood itself is expensive to evaluate on every observation. That construction and the explicit optimal-moment formula are what is new relative to plain subsampling MLE work. The rest follows standard GMM and MLE expansions. The soft spot is the dependence issue. The subsample is drawn from the same finite population as the full moments, so the score and the moment conditions are correlated; the cross-covariance term is order n/N and does not vanish. If the variance formula and the optimality result treat the two pieces as independent or omit that term, the stated reduction and the full-efficiency recovery do not follow exactly. The abstract gives no sign that this cross term is derived or bounded, so the central efficiency claim rests on an unverified step. This is worth a referee because the method is simple to implement and targets a real computational bottleneck; a careful check of the joint CLT would either confirm the result or show the size of the correction needed. Readers already working on scalable MLE or subsampling will find it directly usable. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes a moment-assisted subsampling (MAS) estimator that augments a subsampled MLE with full-data sample moments via GMM. It claims the resulting estimator is asymptotically normal, has strictly smaller asymptotic variance than the plain subsampled MLE, and recovers the full-data MLE efficiency when the optimal moment (in the Loewner sense) is used.

Significance. If the joint asymptotics are correctly derived, the method supplies a computationally cheap route to near-full efficiency for expensive MLE problems by exploiting moments that can be computed on the entire data set; this would be a useful addition to the subsampling literature.

major comments (2)

[Asymptotic theory section] Asymptotic theory section (derivation of the joint CLT and variance formula): the claimed asymptotic variance reduction and the Loewner-order optimality of the optimal moment both rest on the joint limiting distribution of the subsampled score (order 1/n) and the full-data moments (order 1/N). Because the subsample is drawn from the same finite population, the cross-covariance term is nonzero and of order O(n/N). The manuscript must explicitly state whether this term is retained in the sandwich formula; if it is omitted or treated as asymptotically negligible, the stated variance comparison and the recovery of full-data efficiency do not follow.
[Theorem on optimal moment] Theorem on optimal moment (the result that the MAS estimator attains the full-data efficiency bound): the proof must verify that the chosen moment remains correctly specified and that the GMM augmentation does not introduce bias under the model. The regularity conditions under which the post-selection moment remains informative after subsampling should be stated explicitly.

minor comments (2)

[Abstract] The abstract states that the MAS estimator 'can achieve the same estimation efficiency as the whole data-based estimator when the optimal moment is incorporated,' but the precise sense in which equality holds (e.g., asymptotic equivalence or equality of asymptotic variances) should be clarified in the main text.
Notation for the subsample size n versus full size N is used throughout; a single consolidated table or paragraph listing all symbols and their orders would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the asymptotic theory. We address each major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [Asymptotic theory section] Asymptotic theory section (derivation of the joint CLT and variance formula): the claimed asymptotic variance reduction and the Loewner-order optimality of the optimal moment both rest on the joint limiting distribution of the subsampled score (order 1/n) and the full-data moments (order 1/N). Because the subsample is drawn from the same finite population, the cross-covariance term is nonzero and of order O(n/N). The manuscript must explicitly state whether this term is retained in the sandwich formula; if it is omitted or treated as asymptotically negligible, the stated variance comparison and the recovery of full-data efficiency do not follow.

Authors: We thank the referee for highlighting this point. The joint CLT derivation in the manuscript retains the cross-covariance term of order O(n/N) in the sandwich variance formula; this term is not omitted or treated as negligible. With the term included, the claimed variance reduction relative to subsampled MLE and the attainment of full-data efficiency under the optimal moment both hold. We will add an explicit statement in the revised asymptotic theory section confirming that the cross-covariance is retained. revision: yes
Referee: [Theorem on optimal moment] Theorem on optimal moment (the result that the MAS estimator attains the full-data efficiency bound): the proof must verify that the chosen moment remains correctly specified and that the GMM augmentation does not introduce bias under the model. The regularity conditions under which the post-selection moment remains informative after subsampling should be stated explicitly.

Authors: The full-data sample moments are correctly specified under the model, so the GMM step introduces no bias. We will revise the proof of the optimal-moment theorem to explicitly verify correct specification and lack of bias, and we will add the required regularity conditions (including those ensuring the moments remain informative after subsampling) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Asymptotics derived from standard GMM theory; no reduction to self-defined inputs

full rationale

The MAS estimator is formed by augmenting the subsampled score with full-data moments inside a GMM framework. The claimed asymptotic normality, variance reduction, and Loewner-optimal moment all follow from the standard GMM sandwich formula and the usual joint CLT for estimating equations. These results are external to the paper and do not reduce to any quantity defined by the paper itself. No self-citation is load-bearing for the central claims, and the derivation does not rename a fitted quantity as a prediction or smuggle an ansatz via prior work. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on standard large-sample regularity conditions for MLE and GMM; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Standard regularity conditions ensuring asymptotic normality of MLE and GMM estimators
Invoked to obtain the asymptotic distribution and variance comparison of the MAS estimator.

pith-pipeline@v0.9.0 · 5725 in / 1035 out tokens · 20599 ms · 2026-05-24T06:55:02.046544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page
[3]

(2021 a ), Optimal Subsampling Algorithms for Big Data Regressions, Statistica Sinica, 31, 749--772

Ai, M., Yu, J., Zhang, H., and Wang, H. (2021 a ), Optimal Subsampling Algorithms for Big Data Regressions, Statistica Sinica, 31, 749--772

work page 2021
[4]

--- (2021 b ), Optimal subsampling for large-scale quantile regression, Journal of Complexity, 62, 101512

work page 2021
[5]

W., and Muthukrishnan, S

Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006), Sampling algorithms for l_2 regression and applications, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 1127--1136

work page 2006
[6]

(2022), Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data, arXiv

Fan, Y., Liu, Y., Liu, Y., and Qin, J. (2022), Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data, arXiv

work page 2022
[7]

and Hastie, T

Fithian, W. and Hastie, T. (2014), Local case-control sampling: Efficient subsampling in imbalanced data sets , The Annals of Statistics, 42, 1693 -- 1724

work page 2014
[8]

M., Yang, T., and Zhang, T

Han, L., Tan, K. M., Yang, T., and Zhang, T. (2020), Local uncertainty sampling for large-scale multiclass logistic regression , The Annals of Statistics, 48, 1770 -- 1788

work page 2020
[9]

Hansen, L. P. (1982), Large Sample Properties of Generalized Method of Moments Estimators , Econometrica, 50, 1029--1054

work page 1982
[10]

Hoffman, A. J. and Wielandt, H. W. (1953), The variation of the spectrum of a normal matrix , Duke Mathematical Journal, 20, 37 -- 39

work page 1953
[11]

and Nisan, N

Kushilevitz, E. and Nisan, N. (1996), Communication Complexity, Cambridge University Press

work page 1996
[12]

Newey, W. K. (1994), The asymptotic variance of semiparametric estimators, Econometrica: Journal of the Econometric Society, 1349--1382

work page 1994
[13]

Team, R. C. (2016), A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org

work page 2016
[14]

van der Vaart, A. W. (2000), Asymptotic Statistics, Cambridge University Press, New York

work page 2000
[15]

(2019), More Efficient Estimation for Logistic Regression with Optimal Subsamples, Journal of Machine Learning Research, 20, 1--59

Wang, H. (2019), More Efficient Estimation for Logistic Regression with Optimal Subsamples, Journal of Machine Learning Research, 20, 1--59

work page 2019
[16]

and Kim, J

Wang, H. and Kim, J. K. (2022), Maximum sampled conditional likelihood for informative subsampling, Journal of Machine Learning Research, 23, 1--50

work page 2022
[17]

and Ma, Y

Wang, H. and Ma, Y. (2021), Optimal subsampling for quantile regression in big data, Biometrika, 108, 1--14

work page 2021
[18]

(2019), Information-Based Optimal Subdata Selection for Big Data Linear Regression, Journal of the American Statistical Association, 114, 393--405

Wang, H., Yang, M., and Stufken, J. (2019), Information-Based Optimal Subdata Selection for Big Data Linear Regression, Journal of the American Statistical Association, 114, 393--405

work page 2019
[19]

(2018), Optimal Subsampling for Large Sample Logistic Regression, Journal of the American Statistical Association, 113, 829--844

Wang, H., Zhu, R., and Ma, P. (2018), Optimal Subsampling for Large Sample Logistic Regression, Journal of the American Statistical Association, 113, 829--844

work page 2018
[20]

(1980), On Some Ordering Properties of the Generalized Inverses of Nonnegative Definite Matrices, Linear Algebra and its Applications, 32, 49 -- 60

Wu, C.-F. (1980), On Some Ordering Properties of the Generalized Inverses of Nonnegative Definite Matrices, Linear Algebra and its Applications, 32, 49 -- 60

work page 1980
[21]

and Wang, H

Yao, Y. and Wang, H. (2019), Optimal subsampling for softmax regression, Statistical Papers, 60, 585--599

work page 2019
[22]

(2022), Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data, Journal of the American Statistical Association, 117, 1--12

Yu, J., Wang, H., Ai, M., and Zhang, H. (2022), Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data, Journal of the American Statistical Association, 117, 1--12

work page 2022

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page

[3] [3]

(2021 a ), Optimal Subsampling Algorithms for Big Data Regressions, Statistica Sinica, 31, 749--772

Ai, M., Yu, J., Zhang, H., and Wang, H. (2021 a ), Optimal Subsampling Algorithms for Big Data Regressions, Statistica Sinica, 31, 749--772

work page 2021

[4] [4]

--- (2021 b ), Optimal subsampling for large-scale quantile regression, Journal of Complexity, 62, 101512

work page 2021

[5] [5]

W., and Muthukrishnan, S

Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006), Sampling algorithms for l_2 regression and applications, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 1127--1136

work page 2006

[6] [6]

(2022), Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data, arXiv

Fan, Y., Liu, Y., Liu, Y., and Qin, J. (2022), Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data, arXiv

work page 2022

[7] [7]

and Hastie, T

Fithian, W. and Hastie, T. (2014), Local case-control sampling: Efficient subsampling in imbalanced data sets , The Annals of Statistics, 42, 1693 -- 1724

work page 2014

[8] [8]

M., Yang, T., and Zhang, T

Han, L., Tan, K. M., Yang, T., and Zhang, T. (2020), Local uncertainty sampling for large-scale multiclass logistic regression , The Annals of Statistics, 48, 1770 -- 1788

work page 2020

[9] [9]

Hansen, L. P. (1982), Large Sample Properties of Generalized Method of Moments Estimators , Econometrica, 50, 1029--1054

work page 1982

[10] [10]

Hoffman, A. J. and Wielandt, H. W. (1953), The variation of the spectrum of a normal matrix , Duke Mathematical Journal, 20, 37 -- 39

work page 1953

[11] [11]

and Nisan, N

Kushilevitz, E. and Nisan, N. (1996), Communication Complexity, Cambridge University Press

work page 1996

[12] [12]

Newey, W. K. (1994), The asymptotic variance of semiparametric estimators, Econometrica: Journal of the Econometric Society, 1349--1382

work page 1994

[13] [13]

Team, R. C. (2016), A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org

work page 2016

[14] [14]

van der Vaart, A. W. (2000), Asymptotic Statistics, Cambridge University Press, New York

work page 2000

[15] [15]

(2019), More Efficient Estimation for Logistic Regression with Optimal Subsamples, Journal of Machine Learning Research, 20, 1--59

Wang, H. (2019), More Efficient Estimation for Logistic Regression with Optimal Subsamples, Journal of Machine Learning Research, 20, 1--59

work page 2019

[16] [16]

and Kim, J

Wang, H. and Kim, J. K. (2022), Maximum sampled conditional likelihood for informative subsampling, Journal of Machine Learning Research, 23, 1--50

work page 2022

[17] [17]

and Ma, Y

Wang, H. and Ma, Y. (2021), Optimal subsampling for quantile regression in big data, Biometrika, 108, 1--14

work page 2021

[18] [18]

(2019), Information-Based Optimal Subdata Selection for Big Data Linear Regression, Journal of the American Statistical Association, 114, 393--405

Wang, H., Yang, M., and Stufken, J. (2019), Information-Based Optimal Subdata Selection for Big Data Linear Regression, Journal of the American Statistical Association, 114, 393--405

work page 2019

[19] [19]

(2018), Optimal Subsampling for Large Sample Logistic Regression, Journal of the American Statistical Association, 113, 829--844

Wang, H., Zhu, R., and Ma, P. (2018), Optimal Subsampling for Large Sample Logistic Regression, Journal of the American Statistical Association, 113, 829--844

work page 2018

[20] [20]

(1980), On Some Ordering Properties of the Generalized Inverses of Nonnegative Definite Matrices, Linear Algebra and its Applications, 32, 49 -- 60

Wu, C.-F. (1980), On Some Ordering Properties of the Generalized Inverses of Nonnegative Definite Matrices, Linear Algebra and its Applications, 32, 49 -- 60

work page 1980

[21] [21]

and Wang, H

Yao, Y. and Wang, H. (2019), Optimal subsampling for softmax regression, Statistical Papers, 60, 585--599

work page 2019

[22] [22]

(2022), Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data, Journal of the American Statistical Association, 117, 1--12

Yu, J., Wang, H., Ai, M., and Zhang, H. (2022), Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data, Journal of the American Statistical Association, 117, 1--12

work page 2022