plmmr: an R package to fit penalized linear mixed models for genome-wide association data with complex correlation structure
Pith reviewed 2026-05-23 04:21 UTC · model grok-4.3
The pith
plmmr fits penalized linear mixed models to GWAS data by estimating correlations among observations to improve BLUP predictions while using memory-mapping for data larger than RAM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
plmmr implements penalized linear mixed models that estimate correlation among observations in high-dimensional data and use those estimates to improve prediction with the best linear unbiased predictor, supported by memory-mapping that allows genome-scale data to be analyzed on ordinary machines even when the data size exceeds RAM, as demonstrated through the package's methods, workflow, file-backing approach, and examples from real GWAS data.
What carries the argument
Penalized linear mixed model that estimates the correlation structure and applies the best linear unbiased predictor, combined with memory-mapping for file-backed storage of large genomic matrices.
If this is right
- Genome-scale datasets can be analyzed on standard laptops or desktops without requiring data to fit entirely in RAM.
- Prediction in GWAS settings improves when the estimated correlation among observations is incorporated via the best linear unbiased predictor.
- Users gain a complete workflow in R for fitting these models with file-backing and memory-mapping.
- The approach handles complex correlation structures that arise in real genome-wide association studies.
Where Pith is reading between the lines
- The same memory-mapping strategy could be applied to other high-dimensional regression problems outside genetics where correlation among samples matters.
- Direct comparisons of out-of-sample prediction error between plmmr and ordinary penalized regression on held-out GWAS cohorts would quantify the practical gain from the correlation adjustment.
- The package's file-backing layer might integrate with existing genomic data pipelines to reduce preprocessing steps for very large cohorts.
Load-bearing premise
The correlation structure estimated from the data is accurate and stable enough that feeding it into the best linear unbiased predictor produces meaningful gains in the penalized setting.
What would settle it
A simulation or real GWAS analysis in which predictions from the plmmr model show no improvement or perform worse than a standard penalized regression that ignores the estimated correlations.
Figures
read the original abstract
Correlation among the observations in high-dimensional regression modeling can be a major source of confounding. We present a new open-source package, plmmr, to implement penalized linear mixed models in R. This R package estimates correlation among observations in high-dimensional data and uses those estimates to improve prediction with the best linear unbiased predictor. The package uses memory-mapping so that genome-scale data can be analyzed on ordinary machines even if the size of data exceeds RAM. We present here the methods, workflow, and file-backing approach upon which plmmr is built, and we demonstrate its computational capabilities with two examples from real GWAS data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents plmmr, an R package for fitting penalized linear mixed models to genome-wide association data. It estimates correlation structures in high-dimensional data and incorporates the best linear unbiased predictor (BLUP) to enhance prediction accuracy within a penalized regression framework. The package utilizes memory-mapping to enable analysis of genome-scale datasets on standard hardware without requiring data to fit in RAM. The paper details the underlying methods, workflow, and file-backing approach, and illustrates the package's capabilities through examples on real GWAS datasets.
Significance. This work provides a practical software tool that addresses challenges in statistical genetics involving correlated observations. By combining penalized LMMs with BLUP and scalable data handling, plmmr could facilitate more robust analyses in GWAS where ignoring correlations might lead to biased results. The open-source implementation and demonstration on real data are strengths that support its potential utility in the field.
minor comments (2)
- [Abstract] Abstract: the claim that the package 'uses those estimates to improve prediction with the best linear unbiased predictor' is central but the provided text gives no quantitative metrics (e.g., prediction error reduction or cross-validation scores) from the two real-data examples; adding a concise summary of these results would strengthen the abstract without altering scope.
- The workflow description would benefit from an explicit algorithmic outline (e.g., steps for correlation estimation followed by penalized objective with BLUP adjustment) to make the integration of standard methods fully transparent for users.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the package's utility for correlated GWAS data, and recommendation of minor revision. No major comments were raised in the report.
Circularity Check
No significant circularity; software implementation of established methods
full rationale
The manuscript describes the plmmr R package for penalized linear mixed models on GWAS data, focusing on correlation estimation, BLUP-based prediction, and file-backed matrix handling for scale. No derivation chain is presented that reduces any claimed prediction or result to its own inputs by construction. The workflow follows standard LMM theory without self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations for uniqueness. The paper is self-contained as an implementation description with runtime examples on real data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear mixed model assumptions hold for the correlation structure in GWAS data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
preconditioning ... Σ^{-1/2}y ∼ N((Σ^{-1/2}X)β, I) ... penalized regression approaches such as lasso ... may be applied
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Identifying large sets of unrelated individuals and unrelated markers
Kuruvilla Joseph Abraham and Clara Diaz. Identifying large sets of unrelated individuals and unrelated markers. Source code for biology and medicine, 9: 0 1--8, 2014
work page 2014
-
[2]
Matrix: Sparse and Dense Matrix Classes and Methods, 2024
Douglas Bates, Martin Maechler, and Mikael Jagan. Matrix: Sparse and Dense Matrix Classes and Methods, 2024. URL https://CRAN.R-project.org/package=Matrix. R package version 1.7-0
work page 2024
-
[3]
Terri H Beaty, Mary L Marazita, and Elizabeth J Leslie. Genetic factors influencing risk to orofacial clefts: today’s challenges and tomorrow’s opportunities. F1000Research, 5, 2016
work page 2016
-
[4]
Family-based genome-wide association studies
Beben Benyamin, Peter M Visscher, and Allan F McRae. Family-based genome-wide association studies. Pharmacogenomics, 10 0 (2): 0 181--190, 2009
work page 2009
-
[5]
Kane, John Emerson, and Stephen Weston
Frederic Bertrand, Michael J. Kane, John Emerson, and Stephen Weston. 'BLAS' and 'LAPACK' Routines for Native R Matrices and 'big.matrix' Objects, 2024. URL https://fbertran.github.io/bigalgebra/. R package version 1.1.2
work page 2024
-
[6]
Sahir R Bhatnagar, Yi Yang, Tianyuan Lu, Erwin Schurr, JC Loredo-Osti, Marie Forest, Karim Oualkacha, and Celia MT Greenwood. Simultaneous snp selection and adjustment for population structure in high dimensional prediction models. PLoS genetics, 16 0 (5): 0 e1008766, 2020
work page 2020
-
[7]
Spectral deconfounding via perturbed sparse linear models
Domagoj \'C evid, Peter B \"u hlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21 0 (232): 0 1--41, 2020
work page 2020
-
[8]
Variable selection via nonconcave penalized likelihood and its oracle properties
Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 0 (456): 0 1348--1360, 2001
work page 2001
-
[9]
Efficient algorithms for finding maximum matching in graphs
Zvi Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18 0 (1): 0 23--38, 1986
work page 1986
-
[10]
E. Gorstein, R. Aghdam, and C. Sol' i s-Lemus. HighDimMixedModels.jl: Robust High Dimensional Mixed Models across Omics Data . In preparation, 2024
work page 2024
-
[11]
The elements of statistical learning: data mining, inference, and prediction, volume 2
Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009
work page 2009
-
[12]
Hillary M. Heiling, Naim U. Rashid, Quefeng Li, and Joseph G. Ibrahim. glmmpen: High dimensional penalized generalized linear mixed models. The R Journal, 15: 0 106--128, 2024. ISSN 2073-4859. doi:10.32614/RJ-2023-086. https://doi.org/10.32614/RJ-2023-086
-
[13]
Preconditioning the lasso for sign consistency
Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9 0 (1): 0 1150--1172, 2015. doi:10.1214/15-EJS1029
-
[14]
A resource-efficient tool for mixed model association analysis of large-scale data
Longda Jiang, Zhili Zheng, Ting Qi, Kathryn E Kemper, Naomi R Wray, Peter M Visscher, and Jian Yang. A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics, 51 0 (12): 0 1749--1755, 2019
work page 2019
-
[15]
Michael J. Kane, John W. Emerson, and Stephen Weston. Scalable strategies for computing with massive data. Journal of Statistical Software, 55 0 (14): 0 1--19, 2013. URL https://www.jstatsoft.org/article/view/v055i14
work page 2013
-
[16]
J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3 0 (9): 0 e161, 2007. doi:10.1371/journal.pgen.0030161
-
[17]
Jeffrey T. Leek, Robert B. Scharpf, H \'e ctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11 0 (10): 0 733--739, October 2010. ISSN 1471-0056. doi:10.1038/nrg2825
-
[18]
E.J. Leslie, D.C. Koboldt, C.J. Kang, L. Ma, J.T. Hecht, G.L. Wehby, K. Christensen, A.E. Czeizel, F.W.-B. Deleyiannis, R.S. Fulton, R.K. Wilson, T.H. Beaty, B.C. Schutte, J.C. Murray, and M.L. Marazita. IRF 6mutation screening in non-syndromic orofacial clefting: analysis of 1521 families. Clinical Genetics, 90 0 (1): 0 28--34, oct 2015 a . doi:10.1111/cge.12675
-
[19]
Genetics of cleft lip and cleft palate
Elizabeth J Leslie and Mary L Marazita. Genetics of cleft lip and cleft palate. American Journal of Medical Genetics Part C: Seminars in Medical Genetics, 163 0 (4): 0 246--258, 2013. doi:https://doi.org/10.1002/ajmg.c.31381
-
[20]
Elizabeth J. Leslie, Margaret A. Taub, Huan Liu, Karyn Meltz Steinberg, Daniel C. Koboldt, Qunyuan Zhang, Jenna C. Carlson, Jacqueline B. Hetmanski, Hang Wang, David E. Larson, Robert S. Fulton, Youssef A. Kousa, Walid D. Fakhouri, Ali Naji, Ingo Ruczinski, Ferdouse Begum, Margaret M. Parker, Tamara Busch, Jennifer Standley, Jennifer Rigdon, Jacqueline T....
-
[21]
Xihao Li, Zilin Li, Hufeng Zhou, Sheila M Gaynor, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K Arnett, Stella Aslibekyan, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature genetics, 52 0 (9): 0 969--983, 2020
work page 2020
-
[22]
Efficient bayesian mixed-model analysis increases association power in large cohorts
Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Rany M Salem, Daniel I Chasman, Paul M Ridker, Benjamin M Neale, Bonnie Berger, et al. Efficient bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47 0 (3): 0 284--290, 2015
work page 2015
-
[23]
Pittsburgh orofacial cleft studies, September 2024
Mary Marazita and Seth Weinberg. Pittsburgh orofacial cleft studies, September 2024. URL https://www.dental.pitt.edu/research/ccdg/participate-research/pittsburgh-orofacial-cleft-studies. Center for Craniofacial and Dental Genetics, University of Pittsburgh. Website
work page 2024
-
[24]
Computationally efficient whole-genome regression for quantitative and binary traits
Joelle Mbatchou, Leland Barnard, Joshua Backman, Anthony Marcketta, Jack A Kosmicki, Andrey Ziyatdinov, Christian Benner, Colm O’Dushlaine, Mathew Barber, Boris Boutkov, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nature genetics, 53 0 (7): 0 1097--1103, 2021
work page 2021
-
[25]
The gwas diversity monitor tracks diversity by disease in real time
Melinda C Mills and Charles Rahal. The gwas diversity monitor tracks diversity by disease in real time. Nature genetics, 52 0 (3): 0 242--243, 2020
work page 2020
-
[26]
Principal components analysis corrects for stratification in genome-wide association studies
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38 0 (8): 0 904--909, 2006
work page 2006
-
[27]
Efficient analysis of large-scale genome-wide data with two r packages: bigstatsr and bigsnpr
Florian Priv \'e , Hugues Aschard, Andrey Ziyatdinov, and Michael GB Blum. Efficient analysis of large-scale genome-wide data with two r packages: bigstatsr and bigsnpr. Bioinformatics, 34 0 (16): 0 2781--2787, 2018
work page 2018
-
[28]
Plink: a tool set for whole-genome association and population-based linkage analyses
Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker, Mark J Daly, et al. Plink: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81 0 (3): 0 559--575, 2007
work page 2007
-
[29]
R: A Language and Environment for Statistical Computing
R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/
work page 2024
-
[30]
Cross-validation for correlated data
Assaf Rabinowicz and Saharon Rosset. Cross-validation for correlated data. Journal of the American Statistical Association, 117 0 (538): 0 718--731, 2022
work page 2022
-
[31]
A lasso multi-marker mixed model for association mapping with population structure correction
Barbara Rakitsch, Christoph Lippert, Oliver Stegle, and Karsten Borgwardt. A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29 0 (2): 0 206--214, 2013
work page 2013
-
[32]
Muredach P Reilly, Mingyao Li, Jing He, Jane F Ferguson, Ioannis M Stylianou, Nehal N Mehta, Mary Susan Burnett, Joseph M Devaney, Christopher W Knouff, John R Thompson, et al. Identification of adamts7 as a novel locus for coronary atherosclerosis and association of abo with myocardial infarction in the presence of coronary atherosclerosis: two genome-wi...
work page 2011
-
[33]
G. K. Robinson. That BLUP is a good thing: The estimation of random effects. Statistical Science, 6 0 (1): 0 15--32, 1991
work page 1991
-
[34]
Julien St-Pierre, Karim Oualkacha, and Sahir Rai Bhatnagar. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics, 39 0 (2): 0 btad063, 2023
work page 2023
-
[35]
Utilizing graph theory to select the largest set of unrelated individuals for genetic analysis
Jeffrey Staples, Deborah A Nickerson, and Jennifer E Below. Utilizing graph theory to select the largest set of unrelated individuals for genetic analysis. Genetic epidemiology, 37 0 (2): 0 136--141, 2013
work page 2013
-
[36]
Stuart C Thomas. The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations. Philosophical Transactions of the Royal Society B: Biological Sciences, 360 0 (1459): 0 1457--1467, 2005
work page 2005
-
[37]
Regression shrinkage and selection via the lasso
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996
work page 1996
-
[38]
Genetic algorithm for the personnel assignment problem with multiple objectives
Ismail H Toroslu and Yilmaz Arslanoglu. Genetic algorithm for the personnel assignment problem with multiple objectives. Information Sciences, 177 0 (3): 0 787--803, 2007
work page 2007
-
[39]
Andrew J Wathen. Preconditioning. Acta Numerica, 24: 0 329--376, 2015
work page 2015
-
[40]
Yaohui Zeng and Patrick Breheny. The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r. R Journal, 12 0 (2): 0 6--19, 2021. URL https://doi.org/10.32614/RJ-2021-001
-
[41]
C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38 0 (2): 0 894--942, 2010
work page 2010
-
[42]
Wei Zhou, Jonas B Nielsen, Lars G Fritsche, Rounak Dey, Maiken E Gabrielsen, Brooke N Wolford, Jonathon LeFaive, Peter VandeHaar, Sarah A Gagliano, Aliya Gifford, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50 0 (9): 0 1335--1341, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.