pith. sign in

arxiv: 2502.01577 · v1 · submitted 2025-02-03 · 📊 stat.CO

plmmr: an R package to fit penalized linear mixed models for genome-wide association data with complex correlation structure

Pith reviewed 2026-05-23 04:21 UTC · model grok-4.3

classification 📊 stat.CO
keywords plmmrpenalized linear mixed modelsGWAScorrelation structurebest linear unbiased predictormemory-mappingR packagegenome-wide association
0
0 comments X

The pith

plmmr fits penalized linear mixed models to GWAS data by estimating correlations among observations to improve BLUP predictions while using memory-mapping for data larger than RAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents plmmr, an R package for fitting penalized linear mixed models to high-dimensional genome-wide association data. It estimates the correlation structure among observations directly from the data and incorporates those estimates into the best linear unbiased predictor to refine predictions. The package relies on memory-mapping and file-backing so that genome-scale matrices can be analyzed on ordinary computers even when they exceed available RAM. A reader would care because correlations frequently confound regression in genetic studies, and the tool makes penalized mixed-model analysis practical without specialized hardware. The manuscript describes the underlying methods, workflow, and two real-data examples to illustrate its use.

Core claim

plmmr implements penalized linear mixed models that estimate correlation among observations in high-dimensional data and use those estimates to improve prediction with the best linear unbiased predictor, supported by memory-mapping that allows genome-scale data to be analyzed on ordinary machines even when the data size exceeds RAM, as demonstrated through the package's methods, workflow, file-backing approach, and examples from real GWAS data.

What carries the argument

Penalized linear mixed model that estimates the correlation structure and applies the best linear unbiased predictor, combined with memory-mapping for file-backed storage of large genomic matrices.

If this is right

  • Genome-scale datasets can be analyzed on standard laptops or desktops without requiring data to fit entirely in RAM.
  • Prediction in GWAS settings improves when the estimated correlation among observations is incorporated via the best linear unbiased predictor.
  • Users gain a complete workflow in R for fitting these models with file-backing and memory-mapping.
  • The approach handles complex correlation structures that arise in real genome-wide association studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-mapping strategy could be applied to other high-dimensional regression problems outside genetics where correlation among samples matters.
  • Direct comparisons of out-of-sample prediction error between plmmr and ordinary penalized regression on held-out GWAS cohorts would quantify the practical gain from the correlation adjustment.
  • The package's file-backing layer might integrate with existing genomic data pipelines to reduce preprocessing steps for very large cohorts.

Load-bearing premise

The correlation structure estimated from the data is accurate and stable enough that feeding it into the best linear unbiased predictor produces meaningful gains in the penalized setting.

What would settle it

A simulation or real GWAS analysis in which predictions from the plmmr model show no improvement or perform worse than a standard penalized regression that ignores the estimated correlations.

Figures

Figures reproduced from arXiv: 2502.01577 by Anna C. Reisetter, Oscar A. Rysavy, Patrick J. Breheny, Tabitha K. Peter, Yujing Lu.

Figure 1
Figure 1. Figure 1: Workflow for plmmr. Steps shown with dotted lines are optional; steps [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Total pipeline time 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Time spent in each stage of pipeline 3.2 Orofacial clefting GWAS To illustrate plmmr at work with more complex correlation structures, we used the plmmr pipleline to analyze data from the Pittsburgh Orofacial Cleft (POFC) study [Marazita and Weinberg, 2024] as our second example. The POFC study was a global, family-based GWAS in which the phenotype of focus was orofacial cleft (e.g., cleft palate). The GWA… view at source ↗
Figure 4
Figure 4. Figure 4: Plot of coefficient paths, POFC data [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plot of cross-validation error, POFC data [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Correlation among the observations in high-dimensional regression modeling can be a major source of confounding. We present a new open-source package, plmmr, to implement penalized linear mixed models in R. This R package estimates correlation among observations in high-dimensional data and uses those estimates to improve prediction with the best linear unbiased predictor. The package uses memory-mapping so that genome-scale data can be analyzed on ordinary machines even if the size of data exceeds RAM. We present here the methods, workflow, and file-backing approach upon which plmmr is built, and we demonstrate its computational capabilities with two examples from real GWAS data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents plmmr, an R package for fitting penalized linear mixed models to genome-wide association data. It estimates correlation structures in high-dimensional data and incorporates the best linear unbiased predictor (BLUP) to enhance prediction accuracy within a penalized regression framework. The package utilizes memory-mapping to enable analysis of genome-scale datasets on standard hardware without requiring data to fit in RAM. The paper details the underlying methods, workflow, and file-backing approach, and illustrates the package's capabilities through examples on real GWAS datasets.

Significance. This work provides a practical software tool that addresses challenges in statistical genetics involving correlated observations. By combining penalized LMMs with BLUP and scalable data handling, plmmr could facilitate more robust analyses in GWAS where ignoring correlations might lead to biased results. The open-source implementation and demonstration on real data are strengths that support its potential utility in the field.

minor comments (2)
  1. [Abstract] Abstract: the claim that the package 'uses those estimates to improve prediction with the best linear unbiased predictor' is central but the provided text gives no quantitative metrics (e.g., prediction error reduction or cross-validation scores) from the two real-data examples; adding a concise summary of these results would strengthen the abstract without altering scope.
  2. The workflow description would benefit from an explicit algorithmic outline (e.g., steps for correlation estimation followed by penalized objective with BLUP adjustment) to make the integration of standard methods fully transparent for users.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the package's utility for correlated GWAS data, and recommendation of minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; software implementation of established methods

full rationale

The manuscript describes the plmmr R package for penalized linear mixed models on GWAS data, focusing on correlation estimation, BLUP-based prediction, and file-backed matrix handling for scale. No derivation chain is presented that reduces any claimed prediction or result to its own inputs by construction. The workflow follows standard LMM theory without self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations for uniqueness. The paper is self-contained as an implementation description with runtime examples on real data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The package rests on standard assumptions of linear mixed models and penalized regression; no new free parameters, axioms, or invented entities are introduced beyond the software implementation itself.

axioms (1)
  • domain assumption Linear mixed model assumptions hold for the correlation structure in GWAS data
    Invoked in the description of estimating correlations to improve BLUP predictions

pith-pipeline@v0.9.0 · 5654 in / 1133 out tokens · 38865 ms · 2026-05-23T04:21:16.055430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Identifying large sets of unrelated individuals and unrelated markers

    Kuruvilla Joseph Abraham and Clara Diaz. Identifying large sets of unrelated individuals and unrelated markers. Source code for biology and medicine, 9: 0 1--8, 2014

  2. [2]

    Matrix: Sparse and Dense Matrix Classes and Methods, 2024

    Douglas Bates, Martin Maechler, and Mikael Jagan. Matrix: Sparse and Dense Matrix Classes and Methods, 2024. URL https://CRAN.R-project.org/package=Matrix. R package version 1.7-0

  3. [3]

    Genetic factors influencing risk to orofacial clefts: today’s challenges and tomorrow’s opportunities

    Terri H Beaty, Mary L Marazita, and Elizabeth J Leslie. Genetic factors influencing risk to orofacial clefts: today’s challenges and tomorrow’s opportunities. F1000Research, 5, 2016

  4. [4]

    Family-based genome-wide association studies

    Beben Benyamin, Peter M Visscher, and Allan F McRae. Family-based genome-wide association studies. Pharmacogenomics, 10 0 (2): 0 181--190, 2009

  5. [5]

    Kane, John Emerson, and Stephen Weston

    Frederic Bertrand, Michael J. Kane, John Emerson, and Stephen Weston. 'BLAS' and 'LAPACK' Routines for Native R Matrices and 'big.matrix' Objects, 2024. URL https://fbertran.github.io/bigalgebra/. R package version 1.1.2

  6. [6]

    Simultaneous snp selection and adjustment for population structure in high dimensional prediction models

    Sahir R Bhatnagar, Yi Yang, Tianyuan Lu, Erwin Schurr, JC Loredo-Osti, Marie Forest, Karim Oualkacha, and Celia MT Greenwood. Simultaneous snp selection and adjustment for population structure in high dimensional prediction models. PLoS genetics, 16 0 (5): 0 e1008766, 2020

  7. [7]

    Spectral deconfounding via perturbed sparse linear models

    Domagoj \'C evid, Peter B \"u hlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21 0 (232): 0 1--41, 2020

  8. [8]

    Variable selection via nonconcave penalized likelihood and its oracle properties

    Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 0 (456): 0 1348--1360, 2001

  9. [9]

    Efficient algorithms for finding maximum matching in graphs

    Zvi Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18 0 (1): 0 23--38, 1986

  10. [10]

    Gorstein, R

    E. Gorstein, R. Aghdam, and C. Sol' i s-Lemus. HighDimMixedModels.jl: Robust High Dimensional Mixed Models across Omics Data . In preparation, 2024

  11. [11]

    The elements of statistical learning: data mining, inference, and prediction, volume 2

    Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009

  12. [12]

    Heiling, Naim U

    Hillary M. Heiling, Naim U. Rashid, Quefeng Li, and Joseph G. Ibrahim. glmmpen: High dimensional penalized generalized linear mixed models. The R Journal, 15: 0 106--128, 2024. ISSN 2073-4859. doi:10.32614/RJ-2023-086. https://doi.org/10.32614/RJ-2023-086

  13. [13]

    Preconditioning the lasso for sign consistency

    Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electronic Journal of Statistics, 9 0 (1): 0 1150--1172, 2015. doi:10.1214/15-EJS1029

  14. [14]

    A resource-efficient tool for mixed model association analysis of large-scale data

    Longda Jiang, Zhili Zheng, Ting Qi, Kathryn E Kemper, Naomi R Wray, Peter M Visscher, and Jian Yang. A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics, 51 0 (12): 0 1749--1755, 2019

  15. [15]

    Kane, John W

    Michael J. Kane, John W. Emerson, and Stephen Weston. Scalable strategies for computing with massive data. Journal of Statistical Software, 55 0 (14): 0 1--19, 2013. URL https://www.jstatsoft.org/article/view/v055i14

  16. [16]

    J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3 0 (9): 0 e161, 2007. doi:10.1371/journal.pgen.0030161

  17. [17]

    Leek, Robert B

    Jeffrey T. Leek, Robert B. Scharpf, H \'e ctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11 0 (10): 0 733--739, October 2010. ISSN 1471-0056. doi:10.1038/nrg2825

  18. [18]

    Leslie, D.C

    E.J. Leslie, D.C. Koboldt, C.J. Kang, L. Ma, J.T. Hecht, G.L. Wehby, K. Christensen, A.E. Czeizel, F.W.-B. Deleyiannis, R.S. Fulton, R.K. Wilson, T.H. Beaty, B.C. Schutte, J.C. Murray, and M.L. Marazita. IRF 6mutation screening in non-syndromic orofacial clefting: analysis of 1521 families. Clinical Genetics, 90 0 (1): 0 28--34, oct 2015 a . doi:10.1111/cge.12675

  19. [19]

    Genetics of cleft lip and cleft palate

    Elizabeth J Leslie and Mary L Marazita. Genetics of cleft lip and cleft palate. American Journal of Medical Genetics Part C: Seminars in Medical Genetics, 163 0 (4): 0 246--258, 2013. doi:https://doi.org/10.1002/ajmg.c.31381

  20. [20]

    Leslie, Margaret A

    Elizabeth J. Leslie, Margaret A. Taub, Huan Liu, Karyn Meltz Steinberg, Daniel C. Koboldt, Qunyuan Zhang, Jenna C. Carlson, Jacqueline B. Hetmanski, Hang Wang, David E. Larson, Robert S. Fulton, Youssef A. Kousa, Walid D. Fakhouri, Ali Naji, Ingo Ruczinski, Ferdouse Begum, Margaret M. Parker, Tamara Busch, Jennifer Standley, Jennifer Rigdon, Jacqueline T....

  21. [21]

    Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale

    Xihao Li, Zilin Li, Hufeng Zhou, Sheila M Gaynor, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K Arnett, Stella Aslibekyan, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature genetics, 52 0 (9): 0 969--983, 2020

  22. [22]

    Efficient bayesian mixed-model analysis increases association power in large cohorts

    Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Rany M Salem, Daniel I Chasman, Paul M Ridker, Benjamin M Neale, Bonnie Berger, et al. Efficient bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47 0 (3): 0 284--290, 2015

  23. [23]

    Pittsburgh orofacial cleft studies, September 2024

    Mary Marazita and Seth Weinberg. Pittsburgh orofacial cleft studies, September 2024. URL https://www.dental.pitt.edu/research/ccdg/participate-research/pittsburgh-orofacial-cleft-studies. Center for Craniofacial and Dental Genetics, University of Pittsburgh. Website

  24. [24]

    Computationally efficient whole-genome regression for quantitative and binary traits

    Joelle Mbatchou, Leland Barnard, Joshua Backman, Anthony Marcketta, Jack A Kosmicki, Andrey Ziyatdinov, Christian Benner, Colm O’Dushlaine, Mathew Barber, Boris Boutkov, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nature genetics, 53 0 (7): 0 1097--1103, 2021

  25. [25]

    The gwas diversity monitor tracks diversity by disease in real time

    Melinda C Mills and Charles Rahal. The gwas diversity monitor tracks diversity by disease in real time. Nature genetics, 52 0 (3): 0 242--243, 2020

  26. [26]

    Principal components analysis corrects for stratification in genome-wide association studies

    Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38 0 (8): 0 904--909, 2006

  27. [27]

    Efficient analysis of large-scale genome-wide data with two r packages: bigstatsr and bigsnpr

    Florian Priv \'e , Hugues Aschard, Andrey Ziyatdinov, and Michael GB Blum. Efficient analysis of large-scale genome-wide data with two r packages: bigstatsr and bigsnpr. Bioinformatics, 34 0 (16): 0 2781--2787, 2018

  28. [28]

    Plink: a tool set for whole-genome association and population-based linkage analyses

    Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker, Mark J Daly, et al. Plink: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81 0 (3): 0 559--575, 2007

  29. [29]

    R: A Language and Environment for Statistical Computing

    R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/

  30. [30]

    Cross-validation for correlated data

    Assaf Rabinowicz and Saharon Rosset. Cross-validation for correlated data. Journal of the American Statistical Association, 117 0 (538): 0 718--731, 2022

  31. [31]

    A lasso multi-marker mixed model for association mapping with population structure correction

    Barbara Rakitsch, Christoph Lippert, Oliver Stegle, and Karsten Borgwardt. A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29 0 (2): 0 206--214, 2013

  32. [32]

    Muredach P Reilly, Mingyao Li, Jing He, Jane F Ferguson, Ioannis M Stylianou, Nehal N Mehta, Mary Susan Burnett, Joseph M Devaney, Christopher W Knouff, John R Thompson, et al. Identification of adamts7 as a novel locus for coronary atherosclerosis and association of abo with myocardial infarction in the presence of coronary atherosclerosis: two genome-wi...

  33. [33]

    G. K. Robinson. That BLUP is a good thing: The estimation of random effects. Statistical Science, 6 0 (1): 0 15--32, 1991

  34. [34]

    Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data

    Julien St-Pierre, Karim Oualkacha, and Sahir Rai Bhatnagar. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics, 39 0 (2): 0 btad063, 2023

  35. [35]

    Utilizing graph theory to select the largest set of unrelated individuals for genetic analysis

    Jeffrey Staples, Deborah A Nickerson, and Jennifer E Below. Utilizing graph theory to select the largest set of unrelated individuals for genetic analysis. Genetic epidemiology, 37 0 (2): 0 136--141, 2013

  36. [36]

    The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations

    Stuart C Thomas. The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations. Philosophical Transactions of the Royal Society B: Biological Sciences, 360 0 (1459): 0 1457--1467, 2005

  37. [37]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996

  38. [38]

    Genetic algorithm for the personnel assignment problem with multiple objectives

    Ismail H Toroslu and Yilmaz Arslanoglu. Genetic algorithm for the personnel assignment problem with multiple objectives. Information Sciences, 177 0 (3): 0 787--803, 2007

  39. [39]

    Preconditioning

    Andrew J Wathen. Preconditioning. Acta Numerica, 24: 0 329--376, 2015

  40. [40]

    The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r

    Yaohui Zeng and Patrick Breheny. The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r. R Journal, 12 0 (2): 0 6--19, 2021. URL https://doi.org/10.32614/RJ-2021-001

  41. [41]

    C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38 0 (2): 0 894--942, 2010

  42. [42]

    Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

    Wei Zhou, Jonas B Nielsen, Lars G Fritsche, Rounak Dey, Maiken E Gabrielsen, Brooke N Wolford, Jonathon LeFaive, Peter VandeHaar, Sarah A Gagliano, Aliya Gifford, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50 0 (9): 0 1335--1341, 2018