Privacy-preserving Meta-analysis through Low-Rank Basis Hunting

Kosuke Imai; Wenqi Shi; Yi Zhang

arxiv: 2604.23847 · v1 · submitted 2026-04-26 · 📊 stat.ME

Privacy-preserving Meta-analysis through Low-Rank Basis Hunting

Wenqi Shi , Kosuke Imai , Yi Zhang This is my paper

Pith reviewed 2026-05-08 05:44 UTC · model grok-4.3

classification 📊 stat.ME

keywords meta-analysislow-rank structureconvex hullbasis recoveryprivacy-preservingfunctional dataconformal predictionprediction intervals

0 comments

The pith

Meta-analysis predicts functions for new populations from study summaries alone by recovering shared low-rank bases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaHunt to handle meta-analysis of function-valued outcomes like regression or treatment effect curves when only study-level summaries and covariates are available. It posits that each study's underlying function belongs to the convex hull of a small collection of shared latent basis functions. These bases are recovered consistently from the observed estimates through an adapted successive projection algorithm with denoising. Study covariates are then linked to the mixing weights via semiparametric or nonparametric models to enable prediction for a target population. The procedure requires no individual-level data, accommodates arbitrary per-study estimators, and supplies conformal prediction intervals with asymptotic marginal coverage under exchangeability.

Core claim

Under the modeling assumption that every study-specific function is a convex combination of a fixed low-dimensional set of latent basis functions, the bases themselves can be recovered consistently from noisy study-level estimates by extending the successive projection algorithm to the functional setting and adding a denoising step. Once the bases are in hand, the combination weights for each study are modeled flexibly against the observed study covariates, which in turn permits direct prediction of the target function for a new population whose covariates are known, all while using only aggregate information and preserving the privacy of individual records.

What carries the argument

The shared low-rank structure in which each study's true function lies in the convex hull of a small set of latent basis functions, recovered via a denoised functional extension of the successive projection algorithm.

If this is right

Prediction of regression or treatment-effect functions becomes possible for a target population using only its covariate profile and the study's aggregate estimates.
Each original study can employ its own machine-learning estimator without forcing a common functional form across studies.
Privacy is maintained because no individual participant records need to leave their originating sites.
Conformal prediction intervals achieve asymptotically valid marginal coverage under exchangeability plus mild estimation-error bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank convex-hull device could be applied to other functional meta-analysis settings such as survival curves or density estimation.
If the number of latent bases is allowed to grow slowly with the number of studies, the method might accommodate more heterogeneous populations than currently assumed.
In domains where data sharing is restricted by regulation, the aggregate-only workflow could enable collaborative modeling that would otherwise be infeasible.

Load-bearing premise

The true functions from the different studies are all convex combinations of the same small collection of latent basis functions.

What would settle it

Collect many studies, apply the basis-hunting procedure, and check whether the recovered bases plus estimated weights can approximate the held-out study functions with error that shrinks at the expected rate as study sample sizes grow; failure of this approximation on real or simulated data with known convex-hull structure would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.23847 by Kosuke Imai, Wenqi Shi, Yi Zhang.

**Figure 1.** Figure 1: (Left panel) Reconstruction error decreases with view at source ↗

**Figure 2.** Figure 2: Mean squared error (MSE) of d-fSPA as a function of the number of studies view at source ↗

**Figure 3.** Figure 3: Average marginal coverage (left) and 95% prediction interval length (right) versus view at source ↗

**Figure 4.** Figure 4: Prediction intervals, predicted values, and expected values versus view at source ↗

**Figure 5.** Figure 5: Scatterplots of predicted target-site mean outcomes versus the corresponding empir view at source ↗

**Figure 6.** Figure 6: Scatterplots of predicted target-site ATE versus the corresponding empirical bench view at source ↗

**Figure 7.** Figure 7: Hypothesis-specific performance for the causal inference task under leave-one-site view at source ↗

read the original abstract

A central challenge of meta-analysis is that the populations underlying existing studies often differ from the target population in unknown ways. We study the problem of predicting function-valued quantities, such as regression and conditional average treatment effect functions, for a new target population using only study-level covariates and estimates. We propose MetaHunt, a new meta-analysis methodology based on a shared low-rank structure, in which the true function from each study lies within the convex hull of a small set of latent basis functions. To recover these basis functions, we extend the Successive Projection Algorithm to the functional setting, incorporating a denoised basis-hunting step. We establish consistency of the recovered basis functions under mild regularity conditions. We then model the relationship between study-level covariates and the corresponding mixing weights using flexible semi-parametric or non-parametric methods. MetaHunt is privacy-preserving and enables meta-analytic prediction based on study-level information alone, even when individual-level data are unavailable to analysts. In addition, for each study, functions of interest can be estimated using possibly different machine learning algorithms. For uncertainty quantification, we construct prediction intervals via conformal prediction. We show that, under exchangeability and mild estimation-error conditions, these intervals achieve asymptotically valid marginal coverage. We demonstrate the effectiveness of MetaHunt through both simulation studies and empirical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaHunt gives a privacy-preserving route to meta-analyzing functions by recovering shared low-rank bases from study summaries alone, with consistency and conformal guarantees under its modeling assumptions.

read the letter

MetaHunt offers a privacy-preserving method for meta-analyzing function-valued quantities like treatment effect curves using only study-level summaries. It works by positing that each study's function is a convex combination of a handful of shared latent bases, recovers those bases via an adapted Successive Projection Algorithm with denoising, and then links the mixing weights to study covariates through semi- or non-parametric models. The new piece is the functional extension of the basis-hunting step plus the explicit privacy workflow that avoids individual data. It also lets each study use its own machine learning estimator for the functions. The paper claims consistency for the recovered bases under mild conditions and shows that conformal prediction gives valid coverage for the target predictions under exchangeability. This approach is useful because many meta-analyses face privacy barriers or heterogeneous data sources. The simulations and empirical examples are meant to illustrate practical performance. The main soft spot is the modeling assumption itself. If real functions do not lie close to the convex hull of a small basis set, the recovery step could introduce bias that the later steps do not fix. The paper presents this as the key structure rather than deriving it, so users need to assess whether it fits their setting. The number of bases is a tuning parameter that requires choice, and the full details on how the denoising is implemented would need checking in the proofs. Readers who do meta-analysis in regulated fields or who work with functional data will get the most from it. It is not a general-purpose tool but a targeted one for when the low-rank structure is plausible. I would bring this to a reading group for discussion on the assumption and the conformal part. It deserves peer review because the problem is important and the technical steps appear coherent. Recommendation: send it out for review; the contribution is specific enough to warrant expert feedback on the theory and the applications.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaHunt, a privacy-preserving meta-analysis method for predicting function-valued quantities (e.g., regression or CATE functions) in a target population using only study-level covariates and estimates. It assumes each study's true function lies in the convex hull of a small set of latent basis functions, recovers these bases via an extension of the Successive Projection Algorithm incorporating a denoised step, establishes consistency under mild regularity conditions, models the mixing weights with flexible semi- or non-parametric methods, and constructs conformal prediction intervals achieving asymptotic marginal coverage under exchangeability and mild estimation-error conditions. The approach permits heterogeneous machine-learning estimators across studies and is demonstrated via simulations and empirical applications.

Significance. If the consistency and coverage results hold, the work offers a meaningful contribution to meta-analysis by enabling transport of function-valued inferences to new populations while requiring only aggregate data, which is valuable in privacy-constrained settings. The low-rank convex-hull structure provides a structured handle on heterogeneity, the allowance for study-specific estimators adds flexibility, and the combination of basis recovery with conformal inference supplies both point estimates and valid uncertainty quantification. The theoretical claims are presented as conditional on the modeling assumptions rather than unconditional, which is appropriately scoped.

major comments (2)

[Basis recovery and consistency] Basis recovery section: the consistency of the recovered basis functions is established via the extended Successive Projection Algorithm with a denoised step under mild regularity conditions; the manuscript should explicitly state how the denoising modification alters the original SPA guarantees and whether additional conditions on the functional space or noise level are required beyond those stated in the abstract.
[Conformal prediction] Conformal prediction section: the asymptotic marginal coverage is claimed under exchangeability and mild estimation-error conditions; the manuscript should clarify whether the approximation error from the low-rank basis recovery is absorbed into the estimation-error term or requires a separate rate condition to ensure the coverage guarantee remains valid.

minor comments (2)

[Implementation and tuning] The selection procedure for the number of latent basis functions is mentioned as a free parameter; the manuscript would benefit from guidance or a data-driven rule for choosing this number in the simulation and application sections.
[Notation] Notation for estimated versus population quantities (e.g., basis functions and mixing weights) should be made uniform across the theoretical and empirical sections to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and recommendation for minor revision. The comments provide valuable guidance on enhancing the clarity of our theoretical contributions regarding basis recovery and conformal prediction. We respond to each major comment below.

read point-by-point responses

Referee: [Basis recovery and consistency] Basis recovery section: the consistency of the recovered basis functions is established via the extended Successive Projection Algorithm with a denoised step under mild regularity conditions; the manuscript should explicitly state how the denoising modification alters the original SPA guarantees and whether additional conditions on the functional space or noise level are required beyond those stated in the abstract.

Authors: We agree that the manuscript would benefit from an explicit statement on the impact of the denoising modification. Our extension incorporates a denoised basis-hunting step that reduces the effect of noise in the observed functions before applying the successive projections. This modification does not alter the fundamental consistency guarantees of the original SPA; it maintains them under the same mild regularity conditions, provided the denoising is consistent, which holds without additional conditions on the functional space or noise level beyond those in the abstract. In the revised manuscript, we will add a clarifying paragraph in the basis recovery section to detail this relationship. revision: yes
Referee: [Conformal prediction] Conformal prediction section: the asymptotic marginal coverage is claimed under exchangeability and mild estimation-error conditions; the manuscript should clarify whether the approximation error from the low-rank basis recovery is absorbed into the estimation-error term or requires a separate rate condition to ensure the coverage guarantee remains valid.

Authors: We appreciate this request for clarification. The approximation error arising from the low-rank basis recovery is absorbed into the mild estimation-error conditions. Under the established consistency of the basis functions, this error term vanishes at a sufficient rate to be encompassed within the conditions that ensure the asymptotic marginal coverage of the conformal prediction intervals, without necessitating a separate rate condition. We will revise the conformal prediction section to explicitly note this absorption and its implications for the coverage guarantee. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly posits the shared low-rank convex-hull structure as the modeling assumption enabling MetaHunt, then derives consistency of the recovered basis functions by extending the Successive Projection Algorithm with a denoised step under mild regularity conditions. The subsequent semi-parametric modeling of mixing weights from study-level covariates and the conformal prediction intervals are applied separately to the recovered bases and do not reduce any target quantity to a fitted input by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps; the internal logic remains self-contained against the stated assumptions and external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central modeling assumption is the low-rank convex-hull structure; the size of the basis set and any tuning parameters for the basis-hunting algorithm are free parameters. No new physical entities are postulated.

free parameters (1)

number of latent basis functions
The rank of the shared basis set is a modeling choice that must be selected or tuned; it directly determines the dimension of the convex hull representation.

axioms (2)

domain assumption The true function from each study lies within the convex hull of a small set of latent basis functions.
This is the core structural assumption stated in the abstract that enables recovery of the bases from study-level estimates.
domain assumption Mild regularity conditions suffice for consistency of the recovered basis functions.
Invoked to guarantee that the extended SPA recovers the true bases.

pith-pipeline@v0.9.0 · 5524 in / 1475 out tokens · 57185 ms · 2026-05-08T05:44:47.633266+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 1 internal anchor

[1]

The transfer performance of economic models.arXiv preprint arXiv:2202.04796,

Isaiah Andrews, Drew Fudenberg, Lihua Lei, Annie Liang, and Chaofeng Wu. The transfer performance of economic models.arXiv preprint arXiv:2202.04796,

work page arXiv
[2]

and Yang, S

SiqiCaoandShuYang. Heterogeneity-awarefederatedcausalinferenceleveragingeffect-measure transportability. arXiv preprint arXiv:2510.16317,

work page arXiv
[3]

Causally-interpretable random-effects meta-analysis.arXiv preprint arXiv:2302.03544,

Justin M Clark, Kollin W Rott, James S Hodges, and Jared D Huling. Causally-interpretable random-effects meta-analysis.arXiv preprint arXiv:2302.03544,

work page arXiv
[4]

Pairwise covariates-adjusted block model for commu- nity detection.arXiv preprint arXiv:1807.03469,

Sihan Huang, Jiajin Sun, and Yang Feng. Pairwise covariates-adjusted block model for commu- nity detection.arXiv preprint arXiv:1807.03469,

work page arXiv
[5]

Out-of-distribution generalization under random, dense distributional shifts.arXiv preprint arXiv:2404.18370,

Yujin Jeong and Dominik Rothenhäusler. Out-of-distribution generalization under random, dense distributional shifts.arXiv preprint arXiv:2404.18370,

work page arXiv
[6]

Mixed membership estimation for social networks

Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Mixed membership estimation for social networks. Journal of Econometrics, 239(2):105369, 2024a. Jiashun Jin, Zheng Tracy Ke, Gabriel Moryoussef, Jiajun Tang, and Jingming Wang. Improved algorithm and bounds for successive projection.arXiv preprint arXiv:2403.11013, 2024b. 34 Ying Jin, Zhimei Ren, and Emmanue...

work page arXiv
[7]

Minimax regret learning for data with heterogeneous subgroups.arXiv preprint arXiv:2405.01709,

Weibin Mo, Weijing Tang, Songkai Xue, Yufeng Liu, and Ji Zhu. Minimax regret learning for data with heterogeneous subgroups.arXiv preprint arXiv:2405.01709,

work page arXiv
[8]

On the use of weighting for personalized and transparent evidence synthesis.arXiv preprint arXiv:2509.00228,

Wenqi Shi and José R Zubizarreta. On the use of weighting for personalized and transparent evidence synthesis.arXiv preprint arXiv:2509.00228,

work page arXiv
[9]

Federated learning in distributed medical databases: Meta-analysis of large- scale subcortical brain data

Santiago Silva, Boris A Gutman, Eduardo Romero, Paul M Thompson, Andre Altmann, and Marco Lorenzi. Federated learning in distributed medical databases: Meta-analysis of large- scale subcortical brain data. In 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), pages 270–274. IEEE,

2019
[10]

A history of meta-regression: Technical, conceptual, and practical developments between 1974 and 2018.Researchsynthesis methods, 10(2):161–179,

Elizabeth Tipton, James E Pustejovsky, and Hedyeh Ahmadi. A history of meta-regression: Technical, conceptual, and practical developments between 1974 and 2018.Researchsynthesis methods, 10(2):161–179,

1974
[11]

Integration of aggregated data in causally interpretable meta-analysis by inverse weighting.arXiv preprint arXiv:2503.05634,

Tat-Thang Vo, Tran Trong Khoi Le, Sivem Afach, and Stijn Vansteelandt. Integration of aggregated data in causally interpretable meta-analysis by inverse weighting.arXiv preprint arXiv:2503.05634,

work page arXiv
[12]

Federated causal inference in heterogeneous observational data.Statistics in Medicine, 42(24):4418–4439, 2023a

Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T Vogelstein, and Susan Athey. Federated causal inference in heterogeneous observational data.Statistics in Medicine, 42(24):4418–4439, 2023a. Xin Xiong, Zijian Guo, and Tianxi Cai. Distributionally robust transfer learning.arXiv preprint arXiv:2309.06534, 2023b. Jie Xu, Benjamin S Glicksbe...

work page arXiv
[13]

Minimax Regret Estimation for Generalizing Heterogeneous Treatment Effects with Multisite Data

Yi Zhang, Melody Huang, and Kosuke Imai. Minimax regret estimation for generalizing hetero- geneous treatment effects with multisite data.arXiv preprint arXiv:2412.11136, 2024a. Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S Du, and Jason D Lee. Optimal multi- distribution learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5220–5223...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Letd (s) max := max 1≤k≤K ∥g(s),k∥H

This proves the second claim. Letd (s) max := max 1≤k≤K ∥g(s),k∥H. By construction, for each already-selected vertex index kr (r= 1, . . . , s−1), the residual norm∥g (s),kr ∥H is small. Specifically, it equals the norm of the component ofgkr orthogonal toH s−1, and the induction hypothesis ensuresgkr is well- approximated by ˆf (ir) ∈ H s−1. This gives a...

2005

[1] [1]

The transfer performance of economic models.arXiv preprint arXiv:2202.04796,

Isaiah Andrews, Drew Fudenberg, Lihua Lei, Annie Liang, and Chaofeng Wu. The transfer performance of economic models.arXiv preprint arXiv:2202.04796,

work page arXiv

[2] [2]

and Yang, S

SiqiCaoandShuYang. Heterogeneity-awarefederatedcausalinferenceleveragingeffect-measure transportability. arXiv preprint arXiv:2510.16317,

work page arXiv

[3] [3]

Causally-interpretable random-effects meta-analysis.arXiv preprint arXiv:2302.03544,

Justin M Clark, Kollin W Rott, James S Hodges, and Jared D Huling. Causally-interpretable random-effects meta-analysis.arXiv preprint arXiv:2302.03544,

work page arXiv

[4] [4]

Pairwise covariates-adjusted block model for commu- nity detection.arXiv preprint arXiv:1807.03469,

Sihan Huang, Jiajin Sun, and Yang Feng. Pairwise covariates-adjusted block model for commu- nity detection.arXiv preprint arXiv:1807.03469,

work page arXiv

[5] [5]

Out-of-distribution generalization under random, dense distributional shifts.arXiv preprint arXiv:2404.18370,

Yujin Jeong and Dominik Rothenhäusler. Out-of-distribution generalization under random, dense distributional shifts.arXiv preprint arXiv:2404.18370,

work page arXiv

[6] [6]

Mixed membership estimation for social networks

Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Mixed membership estimation for social networks. Journal of Econometrics, 239(2):105369, 2024a. Jiashun Jin, Zheng Tracy Ke, Gabriel Moryoussef, Jiajun Tang, and Jingming Wang. Improved algorithm and bounds for successive projection.arXiv preprint arXiv:2403.11013, 2024b. 34 Ying Jin, Zhimei Ren, and Emmanue...

work page arXiv

[7] [7]

Minimax regret learning for data with heterogeneous subgroups.arXiv preprint arXiv:2405.01709,

Weibin Mo, Weijing Tang, Songkai Xue, Yufeng Liu, and Ji Zhu. Minimax regret learning for data with heterogeneous subgroups.arXiv preprint arXiv:2405.01709,

work page arXiv

[8] [8]

On the use of weighting for personalized and transparent evidence synthesis.arXiv preprint arXiv:2509.00228,

Wenqi Shi and José R Zubizarreta. On the use of weighting for personalized and transparent evidence synthesis.arXiv preprint arXiv:2509.00228,

work page arXiv

[9] [9]

Federated learning in distributed medical databases: Meta-analysis of large- scale subcortical brain data

Santiago Silva, Boris A Gutman, Eduardo Romero, Paul M Thompson, Andre Altmann, and Marco Lorenzi. Federated learning in distributed medical databases: Meta-analysis of large- scale subcortical brain data. In 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), pages 270–274. IEEE,

2019

[10] [10]

A history of meta-regression: Technical, conceptual, and practical developments between 1974 and 2018.Researchsynthesis methods, 10(2):161–179,

Elizabeth Tipton, James E Pustejovsky, and Hedyeh Ahmadi. A history of meta-regression: Technical, conceptual, and practical developments between 1974 and 2018.Researchsynthesis methods, 10(2):161–179,

1974

[11] [11]

Integration of aggregated data in causally interpretable meta-analysis by inverse weighting.arXiv preprint arXiv:2503.05634,

Tat-Thang Vo, Tran Trong Khoi Le, Sivem Afach, and Stijn Vansteelandt. Integration of aggregated data in causally interpretable meta-analysis by inverse weighting.arXiv preprint arXiv:2503.05634,

work page arXiv

[12] [12]

Federated causal inference in heterogeneous observational data.Statistics in Medicine, 42(24):4418–4439, 2023a

Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T Vogelstein, and Susan Athey. Federated causal inference in heterogeneous observational data.Statistics in Medicine, 42(24):4418–4439, 2023a. Xin Xiong, Zijian Guo, and Tianxi Cai. Distributionally robust transfer learning.arXiv preprint arXiv:2309.06534, 2023b. Jie Xu, Benjamin S Glicksbe...

work page arXiv

[13] [13]

Minimax Regret Estimation for Generalizing Heterogeneous Treatment Effects with Multisite Data

Yi Zhang, Melody Huang, and Kosuke Imai. Minimax regret estimation for generalizing hetero- geneous treatment effects with multisite data.arXiv preprint arXiv:2412.11136, 2024a. Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S Du, and Jason D Lee. Optimal multi- distribution learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5220–5223...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Letd (s) max := max 1≤k≤K ∥g(s),k∥H

This proves the second claim. Letd (s) max := max 1≤k≤K ∥g(s),k∥H. By construction, for each already-selected vertex index kr (r= 1, . . . , s−1), the residual norm∥g (s),kr ∥H is small. Specifically, it equals the norm of the component ofgkr orthogonal toH s−1, and the induction hypothesis ensuresgkr is well- approximated by ˆf (ir) ∈ H s−1. This gives a...

2005