pith. sign in

arxiv: 2509.10736 · v2 · submitted 2025-09-12 · 📊 stat.AP

Adaptive Bayesian computation for efficient biobank-scale genomic inference

Pith reviewed 2026-05-18 17:41 UTC · model grok-4.3

classification 📊 stat.AP
keywords adaptive variational inferencebiobank-scale genomicspQTL mappinghierarchical Bayesian modelscoordinate ascentmulti-trait analysiscomputational efficiencyUK Biobank
0
0 comments X

The pith

Adaptive focus strategy in variational inference halves runtime for biobank pQTL mapping

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive focus strategy inside block coordinate ascent variational inference that updates only the parameter subsets judged relevant from current estimates. This targets the computational bottleneck in fitting hierarchical Bayesian models to biobank data that jointly analyze many traits or units. A sympathetic reader would care because full updates become prohibitive at genome-wide scale with thousands of traits and large samples. The approach is shown on a joint model of hierarchically linked regressions for protein QTL mapping, delivering up to 50 percent runtime reduction while preserving statistical performance in both simulated and real UK Biobank proteomic data.

Core claim

We propose an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework that selectively updates subsets of parameters at each iteration, corresponding to units deemed relevant based on current estimates. We illustrate this approach in protein quantitative trait locus (pQTL) mapping using a joint model of hierarchically linked regressions with shared parameters across traits. In both simulated data and real proteomic data from the UK Biobank, AF-CAVI achieves up to a 50% reduction in runtime while maintaining statistical performance. We also provide a genome-wide pipeline for multi-trait pQTL mapping across thousands of traits.

What carries the argument

Adaptive focus (AF) strategy within block coordinate ascent variational inference (CAVI), which selects and updates only relevant parameter subsets based on current variational estimates to concentrate effort on biologically important units.

If this is right

  • Enables routine joint modeling of thousands of traits at biobank scale by cutting computation time.
  • Preserves statistical performance in pQTL mapping tasks compared with full updates.
  • Supports practical genome-wide pipelines for multi-trait analyses.
  • Applies to other hierarchical Bayesian models where effects concentrate in few units.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar selective updating could accelerate variational methods in other high-dimensional sparse-signal domains such as imaging or single-cell data.
  • The method suggests testing dynamic relevance criteria that evolve during optimization rather than fixing them early.
  • Combining the focus strategy with stochastic or parallel updates could yield further speed gains in even larger datasets.

Load-bearing premise

That parameter subsets identified as relevant from current variational estimates are sufficient to preserve the quality of the joint posterior approximation over the full high-dimensional space without systematic under-updating of important units.

What would settle it

Run both full CAVI and AF-CAVI on the same UK Biobank proteomic dataset with thousands of traits and compare the sets of discovered pQTLs and posterior effect estimates; large discrepancies in detected associations would falsify the claim of maintained performance.

Figures

Figures reproduced from arXiv: 2509.10736 by Helene Ruffieux, John Whittaker, Sylvia Richardson, Yiran Li.

Figure 1
Figure 1. Figure 1: Comparision of ROC (left) and PR (right) curves for atlasQTL and univariate testing in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relative differences in runtime (local and total) and iterations for the RF-CAVI and series [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for genome-wide joint pQTL mapping using the AF-CAVI algorithm for the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regression coefficients (a.k.a. BETA, left) and PPI (right) inferred by the vanilla CAVI ad AF-CAVI algorithms in each locus. Maximum value is taken for each response in each locus. perturbation mechanism in the selection of local factors, which shares similarity with the adaptive scanning MCMC suggested in Richardson, Bottolo, and Rosenthal [24] for sparse Bayesian hierarchical regressions. Such ideas hav… view at source ↗
read the original abstract

Motivation: Modern biobanks, with unprecedented sample sizes and phenotypic diversity, have become foundational resources for genomic studies, enabling powerful cross-phenotype and population-scale analyses. As studies grow in complexity, Bayesian hierarchical models offer a principled framework for jointly modeling multiple units such as cells, traits, and experimental conditions, increasing statistical power through information sharing. However, adoption of Bayesian hierarchical models in biobank-scale studies remains limited due to computational inefficiencies, particularly in posterior inference over high-dimensional parameter spaces. Deterministic approximations such as variational inference provide scalable alternatives to Markov Chain Monte Carlo, yet current implementations do not fully exploit the structure of genome-wide multi-unit modeling, especially when biological effects of interest are concentrated in a few units. Results: We propose an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework that selectively updates subsets of parameters at each iteration, corresponding to units deemed relevant based on current estimates. We illustrate this approach in protein quantitative trait locus (pQTL) mapping using a joint model of hierarchically linked regressions with shared parameters across traits. In both simulated data and real proteomic data from the UK Biobank, AF-CAVI achieves up to a 50\% reduction in runtime while maintaining statistical performance. We also provide a genome-wide pipeline for multi-trait pQTL mapping across thousands of traits, demonstrating AF-CAVI as an efficient scheme for large-scale, multi-unit Bayesian analysis in biobanks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework for scalable posterior inference in high-dimensional Bayesian hierarchical models applied to biobank-scale genomic data. The approach selectively updates parameter subsets deemed relevant based on current variational estimates and is illustrated in protein quantitative trait locus (pQTL) mapping via a joint model of hierarchically linked regressions with shared parameters across traits. Empirical evaluations on simulated data and real proteomic data from the UK Biobank report up to 50% runtime reduction while maintaining statistical performance, and a genome-wide pipeline for multi-trait pQTL mapping is presented.

Significance. If the adaptive selection reliably preserves the fidelity of the variational approximation to the joint posterior, the method could meaningfully expand the feasibility of Bayesian hierarchical modeling for multi-unit analyses at biobank scales. The empirical results on both simulated and real UK Biobank data, together with the provided pipeline, support practical utility for large-scale genomic inference.

major comments (2)
  1. [Methods (AF-CAVI algorithm description)] The AF strategy selects parameter blocks for update using thresholds on current variational means or variances. In the hierarchically linked regression model, shared parameters across traits couple the units; an early underestimate for one trait can therefore cause a unit to be skipped even when its marginal contribution to the joint ELBO is non-negligible. Because CAVI updates are coordinate-wise, repeated skipping can leave the variational distribution at a point that is not a stationary point of the full ELBO, violating the usual monotonicity guarantee.
  2. [Results (simulated and real-data experiments)] The paper reports runtime and statistical performance on simulated and UK Biobank data but does not quantify the fraction of units that are permanently excluded or compare final ELBO values between AF-CAVI and full CAVI. This information is needed to assess whether the adaptive approximation preserves quality over the full high-dimensional space.
minor comments (3)
  1. [Abstract] The abstract states that AF-CAVI 'maintains statistical performance' but provides no details on the exact metrics (e.g., power, false discovery rate) or the precise baseline methods used for comparison.
  2. [Results] Sensitivity of results to the choice of relevance threshold is not explored; reporting performance across a range of thresholds would strengthen the robustness claims.
  3. Notation for the hierarchically linked regression model and the shared parameters could be introduced more explicitly with a small illustrative diagram to aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation.

read point-by-point responses
  1. Referee: [Methods (AF-CAVI algorithm description)] The AF strategy selects parameter blocks for update using thresholds on current variational means or variances. In the hierarchically linked regression model, shared parameters across traits couple the units; an early underestimate for one trait can therefore cause a unit to be skipped even when its marginal contribution to the joint ELBO is non-negligible. Because CAVI updates are coordinate-wise, repeated skipping can leave the variational distribution at a point that is not a stationary point of the full ELBO, violating the usual monotonicity guarantee.

    Authors: We appreciate the referee's careful analysis of the convergence implications. The adaptive block selection in AF-CAVI is driven by current variational estimates with the goal of concentrating computation on units that contribute meaningfully to the joint posterior. We acknowledge that this dynamic selection means the standard monotonicity proof for fixed-block coordinate ascent does not apply directly, and the final variational distribution may not be a stationary point of the unrestricted ELBO. In the revised manuscript we will expand the Methods section to discuss this point explicitly, describe the safeguards built into our threshold rules, and report empirical checks confirming that the attained ELBO values remain close to those of full CAVI. revision: yes

  2. Referee: [Results (simulated and real-data experiments)] The paper reports runtime and statistical performance on simulated and UK Biobank data but does not quantify the fraction of units that are permanently excluded or compare final ELBO values between AF-CAVI and full CAVI. This information is needed to assess whether the adaptive approximation preserves quality over the full high-dimensional space.

    Authors: We agree that these diagnostics would provide valuable reassurance about approximation quality. In the revised manuscript we will add quantitative summaries of the fraction of units excluded at each iteration and the proportion that remain permanently skipped. We will also include direct ELBO comparisons between AF-CAVI and standard CAVI on the simulated data and on a representative subset of UK Biobank traits for which full CAVI remains computationally tractable. These additions will allow readers to evaluate the fidelity of the adaptive approximation more rigorously. revision: yes

Circularity Check

0 steps flagged

Empirical runtime gains rest on separate validation datasets with no derivation reducing to fitted inputs

full rationale

The paper introduces an adaptive focus (AF) block-coordinate variational inference scheme for a hierarchically linked multi-trait regression model and reports up to 50% runtime reduction on simulated and UK Biobank pQTL data while preserving statistical performance. No equations or coordinate-ascent updates are shown to be equivalent by construction to the selection thresholds or to any fitted quantity; the central claims are supported by direct comparison against full CAVI on held-out data rather than by self-referential definitions or load-bearing self-citations. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes the hierarchical regression structure and the validity of the relevance heuristic.

pith-pipeline@v0.9.0 · 5793 in / 1037 out tokens · 32177 ms · 2026-05-18T17:41:09.293542+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    UK Biobank: From Concept to Reality

    William Ollier, Sprosen Tim, et al. “UK Biobank: From Concept to Reality”. In:Pharmacoge- nomics6.6 (Sept. 2005), pp. 639–646.DOI:10.2217/14622416.6.6.639

  2. [2]

    Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease

    Wei Zhou, Masahiro Kanai, Kuan-Han H. Wu, et al. “Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease”. In:Cell Genomics2.10 (Oct. 2022).DOI: 10.1016/j.xgen.2022.100192

  3. [3]

    The UK Biobank resource with deep phenotyping and genomic data

    Clare Bycroft, Colin Freeman, Desislava Petkova, et al. “The UK Biobank resource with deep phenotyping and genomic data”. en. In:Nature562.7726 (Oct. 2018), pp. 203–209.DOI: 10. 1038/s41586-018-0579-z

  4. [4]

    Plasma proteomic associations with genetics and health in the UK Biobank

    Benjamin B. Sun, Joshua Chiou, Matthew Traylor, et al. “Plasma proteomic associations with genetics and health in the UK Biobank”. en. In:Nature622.7982 (Oct. 2023), pp. 329–338.DOI: 10.1038/s41586-023-06592-6

  5. [5]

    Genetic associations with ratios between protein levels detect new pQTLs and reveal protein-protein interactions

    Karsten Suhre. “Genetic associations with ratios between protein levels detect new pQTLs and reveal protein-protein interactions”. English. In:Cell Genomics4.3 (Mar. 2024).DOI: 10.1016/ j.xgen.2024.100506

  6. [6]

    Mihir G. Sukhatme, Asha Kar, Uma Thanigai Arasu, et al.Integration of single cell omics with biobank data discovers trans effects of SREBF1 abdominal obesity risk variants on adipocyte expression of more than 100 genes. en. Nov. 2024.DOI: 10.1101/2024.11.22.24317804

  7. [7]

    Bayesian hierarchical modeling for signaling pathway inference from single cell interventional data

    Ruiyan Luo and Hongyu Zhao. “Bayesian hierarchical modeling for signaling pathway inference from single cell interventional data”. In:The annals of applied statistics5.2A (2011), pp. 725–745. DOI:10.1214/10-AOAS425

  8. [8]

    Bayesian Quantitative Trait Loci Mapping for Multiple Traits

    Samprit Banerjee, Brian S. Yandell, and Nengjun Yi. “Bayesian Quantitative Trait Loci Mapping for Multiple Traits”. en. In:Genetics179.4 (Aug. 2008), p. 2275.DOI: 10.1534/genetics. 108.088427

  9. [9]

    A Statistical Framework for Joint eQTL Analysis in Multiple Tissues

    Timothée Flutre, Xiaoquan Wen, Jonathan Pritchard, et al. “A Statistical Framework for Joint eQTL Analysis in Multiple Tissues”. en. In:PLOS Genetics9.5 (May 2013), e1003486.DOI: 10.1371/journal.pgen.1003486

  10. [10]

    HBI: a hierarchical Bayesian interaction model to estimate cell-type-specific methylation quantitative trait loci incorporating priors from cell- sorted bisulfite sequencing data

    Youshu Cheng, Biao Cai, Hongyu Li, et al. “HBI: a hierarchical Bayesian interaction model to estimate cell-type-specific methylation quantitative trait loci incorporating priors from cell- sorted bisulfite sequencing data”. In:Genome Biology25.1 (Oct. 2024), p. 273.DOI:10.1186/ s13059-024-03411-7

  11. [11]

    Journal of the American Statistical Association , author =

    David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. “Variational Inference: A Review for Statisticians”. en. In:Journal of the American Statistical Association112.518 (Apr. 2017), pp. 859– 877.DOI:10.1080/01621459.2017.1285773

  12. [12]

    Spike and slab variable selection: Frequentist and Bayesian strategies

    Hemant Ishwaran and J. Sunil Rao. “Spike and slab variable selection: Frequentist and Bayesian strategies”. In:The Annals of Statistics33.2 (Apr. 2005), pp. 730–773.DOI: 10.1214/009053604000001147

  13. [13]

    The horseshoe estimator for sparse signals

    Carlos M. Carvalho, Nicholas G. Polson, and James G. Scott. “The horseshoe estimator for sparse signals”. en. In:Biometrika97.2 (2010), pp. 465–480

  14. [14]

    Homogenous 96-Plex PEA Immunoas- say Exhibiting High Sensitivity, Specificity, and Excellent Scalability

    Erika Assarsson, Martin Lundberg, Göran Holmquist, et al. “Homogenous 96-Plex PEA Immunoas- say Exhibiting High Sensitivity, Specificity, and Excellent Scalability”. en. In:PLoS ONE9.4 (Apr. 2014). Ed. by Jörg D. Hoheisel, e95192.DOI:10.1371/journal.pone.0095192. 14

  15. [15]

    Sun, Joshua Chiou, Matthew Traylor, et al.Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants

    Benjamin B. Sun, Joshua Chiou, Matthew Traylor, et al.Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants. en. June 2022.DOI: 10.1101/2022.06.17. 496443

  16. [17]

    Efficient inference for genetic association studies with multiple outcomes

    Helene Ruffieux, Anthony C. Davison, Jorg Hager, et al. “Efficient inference for genetic association studies with multiple outcomes”. In:Biostatistics18.4 (Oct. 2017), pp. 618–636.DOI: 10.1093/ biostatistics/kxx007

  17. [18]

    An Integrated Hierarchical Bayesian Model for Multivariate eQTL Mapping

    Marie Pier Scott-Boyer, Gregory C. Imholte, Arafat Tayeb, et al. “An Integrated Hierarchical Bayesian Model for Multivariate eQTL Mapping”. en. In:Statistical Applications in Genetics and Molecular Biology11.4 (Jan. 2012).DOI:10.1515/1544-6115.1760

  18. [19]

    A multi-trait Bayesian method for mapping QTL and genomic prediction

    Kathryn E. Kemper, Philip J. Bowman, Benjamin J. Hayes, et al. “A multi-trait Bayesian method for mapping QTL and genomic prediction”. In:Genetics Selection Evolution50.1 (Mar. 2018), p. 10.DOI:10.1186/s12711-018-0377-y

  19. [20]

    A Systematic Heritability Analysis of the Human Whole Blood Transcriptome

    Tianxiao Huan, Chunyu Liu, Roby Joehanes, et al. “A Systematic Heritability Analysis of the Human Whole Blood Transcriptome”. In:Human genetics134.3 (Mar. 2015), pp. 343–358.DOI: 10.1007/s00439-014-1524-3

  20. [21]

    Approximately independent linkage disequilibrium blocks in human populations

    Tomaz Berisa and Joseph K. Pickrell. “Approximately independent linkage disequilibrium blocks in human populations”. In:Bioinformatics32.2 (Jan. 2016), pp. 283–285.DOI: 10 . 1093 / bioinformatics/btv546

  21. [22]

    The Median Probability Model and Correlated Variables

    Maria M. Barbieri, James O. Berger, Edward I. George, et al. “The Median Probability Model and Correlated Variables”. en. In:Bayesian Analysis16.4 (Dec. 2021).DOI: 10.1214/20-BA1249

  22. [23]

    Stochastic Variational Inference

    Matthew D Hoffman, David M. Blei, Chong Wang, et al. “Stochastic Variational Inference”. en. In: Journal of Machine Learning Research14 (2013), pp. 1303–1347

  23. [24]

    2013 , isbn =

    Sylvia Richardson, Leonardo Bottolo, and Jeffrey S. Rosenthal. “Bayesian Models for Sparse Regression Analysis of High Dimensional Data”. In:Bayesian Statistics 9. Ed. by José M. Bernardo, M. J. Bayarri, James O. Berger, et al. Oxford University Press, Oct. 2011, p. 0.DOI: 10.1093/ acprof:oso/9780199694587.003.0018

  24. [25]

    Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Prob- lems

    Yu. Nesterov. “Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Prob- lems”. en. In:SIAM Journal on Optimization22.2 (Jan. 2012), pp. 341–362.DOI: 10.1137/ 100802001

  25. [26]

    A global-local approach for detecting hotspots in multiple-response regression

    Hélène Ruffieux, Anthony C. Davison, Jörg Hager, et al. “A global-local approach for detecting hotspots in multiple-response regression”. en. In:The Annals of Applied Statistics14.2 (June 2020). DOI:10.1214/20-AOAS1332

  26. [27]

    Ruffieux Ruffieux.ECHOSEQ R-package (https://github.com/hruffieux/echoseq)

  27. [28]

    A fully joint Bayesian quantitative trait locus mapping of human protein abundance in plasma

    Hélène Ruffieux, Jérôme Carayol, Radu Popescu, et al. “A fully joint Bayesian quantitative trait locus mapping of human protein abundance in plasma”. en. In:PLOS Computational Biology16.6 (June 2020), e1007882.DOI:10.1371/journal.pcbi.1007882

  28. [29]

    Robust relationship inference in genome-wide association studies

    Ani Manichaikul, Josyf C. Mychaleckyj, Stephen S. Rich, et al. “Robust relationship inference in genome-wide association studies”. In:Bioinformatics26.22 (Nov. 2010), pp. 2867–2873.DOI: 10.1093/bioinformatics/btq559. 15 Appendices A Details of the atlaQTL model Here we provide details of the atlasQTL model by Ruffieux, Davison, Hager, et al. [26]. Given p...

  29. [30]

    Define the dependence structureγ st =1{X s is associated withy t}for each pair ofX s,y t

  30. [31]

    Simulate the error termsε t with a specified correlation structure in the responses

  31. [32]

    active”, i.e., associated with at least one response, while the other SNPs are set as “inactive

    Simulate the effect sizesβ st. The rest of this section explains the details of each step utilizing functions in the echoseq package [27] and parameters selected according to Ruffieux, Carayol, Popescu, et al. [28]. No missing value is inserted in the simulated responses for simplicity. Step 1: Defining the dependence structureRandomly select ap percentag...