FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling

Guangyi Zhang; Junhao Liu; Yi Dai; Yiyun He

arxiv: 2605.04519 · v1 · submitted 2026-05-06 · 💻 cs.LG · stat.ML

FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling

Guangyi Zhang , Yi Dai , Yiyun He , Junhao Liu This is my paper

Pith reviewed 2026-05-08 16:33 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords federated learningsingle-cell ATAC-seqadaptive samplinginvariant VAEprivacy-preserving analysischromatin accessibilityepigenomicsdimensionality reduction

0 comments

The pith

A tailored federated learning system for single-cell chromatin data enables private multi-institution analysis and outperforms centralized training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that federated learning can be adapted to the extreme dimensionality, sparsity, and cross-site heterogeneity of scATAC-seq data so that institutions can collaborate without exchanging raw sequencing reads. It does so by pairing adaptive leverage-score sampling, which trims the feature space by eighty percent while retaining biologically readable peaks, with an invariant variational autoencoder that minimizes mutual information to separate biological variation from technical batch effects. A convergence result shows the distributed solution stays within bounded distance of the full centralized optimum. If these pieces work, previously blocked joint studies become routine and the sampling step itself functions as an implicit regularizer that reduces noise better than pooling all data in one place.

Core claim

FL-Sailer integrates adaptive leverage score sampling to select interpretable features and cut dimensionality by 80 percent together with an invariant VAE that disentangles biological signals from technical confounders via mutual information minimization, delivering a convergence guarantee to an approximate solution of the original high-dimensional problem with bounded error and empirical performance that exceeds centralized baselines on synthetic and real epigenomic datasets.

What carries the argument

Adaptive leverage score sampling for feature selection and dimensionality reduction, paired with an invariant variational autoencoder using mutual information minimization to isolate biological signals.

If this is right

Multi-institutional scATAC-seq studies become feasible without violating privacy rules on data sharing.
The federated model produces lower technical noise than a centralized model trained on pooled data.
Feature count drops by 80 percent while the retained peaks remain biologically interpretable.
The algorithm converges to a solution whose error relative to the full high-dimensional problem is provably bounded.
The same architecture can be applied to other sparse, high-dimensional epigenomic modalities facing similar institutional barriers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regularizing effect of adaptive sampling may prove useful in federated settings beyond epigenomics where heterogeneity is high.
If the invariant VAE successfully isolates institution-independent signals, the learned representations could serve as inputs for cross-site meta-analyses of regulatory elements.
Extending the sampling strategy to dynamic feature selection during training rounds could further reduce communication costs in very large cohorts.

Load-bearing premise

The adaptive sampling step continues to pick biologically meaningful peaks even when institutions differ in sequencing depth and cell-type composition, and the mutual-information term truly removes technical confounders without injecting new artifacts.

What would settle it

On a real multi-site scATAC-seq collection, run both the federated model and a centralized model on the same downstream task of identifying cell-type-specific peaks; if the federated version recovers fewer known marker regions or shows higher false-positive rates, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.04519 by Guangyi Zhang, Junhao Liu, Yi Dai, Yiyun He.

**Figure 1.** Figure 1: FL-Sailer: Federated Learning for Single-Cell Epigenomics. FL-Sailer makes federated learning feasible for million-feature genomic data by jointly addressing dimensionality (adaptive leverage score sampling: d → s, 80% reduction) and heterogeneity (invariant VAE: I(z, c) minimization), transforming a computationally impossible problem into a practical solution with superior performance. Pipeline: (1) Clien… view at source ↗

**Figure 2.** Figure 2: FL-Sailer overcomes key FL barriers on synthetic scATAC-seq dataset. (a) Performance under homogeneous conditions: FL-Sailer matches centralized accuracy while preserving privacy. (b) Robustness to confounded heterogeneity: Disentangles biological signals from technical noise. (c) Robustness to extreme class imbalance: Maintains rare cell population detection across SNRs. These results show that FL-Sailer… view at source ↗

**Figure 3.** Figure 3: Comparative analysis of FL-Sailer clustering performance on real world scATAC-seq dataset. We evaluate four approaches on Brain PFC (127,219 features), PsychENCODE (423,443 features), and PBMC (108,344 features) datasets, visualized using UMAP projections. We compare our proposed FLSailer architecture with the raw data and various methods: (1) centralized training, i.e. Sailer’s method (Cao et al., 2021);… view at source ↗

**Figure 4.** Figure 4: Sequence depth analysis of FL-Sailer clustering performance on synthetic (left) and real-world view at source ↗

read the original abstract

Single-cell ATAC-seq (scATAC-seq) enables high-resolution mapping of chromatin accessibility, yet privacy regulations and data size constraints hinder multi-institutional sharing. Federated learning (FL) offers a privacy-preserving alternative, but faces three fundamental barriers in scATAC-seq analysis: ultra-high dimensionality, extreme sparsity, and severe cross-institutional heterogeneity. We propose FL-Sailer, the first FL framework designed for scATAC-seq data. FL-Sailer integrates two key innovations: (i) adaptive leverage score sampling, which selects biologically interpretable features while reducing dimensionality by 80%, and (ii) an invariant VAE architecture, which disentangles biological signals from technical confounders via mutual information minimization. We provide a convergence guarantee, showing that FL-Sailer converges to an approximate solution of the original high-dimensional problem with bounded error. Extensive experiments on synthetic and real epigenomic datasets demonstrate that FL-Sailer not only enables previously infeasible multi-institutional collaborations but also surpasses centralized methods by leveraging adaptive sampling as an implicit regularizer to suppress technical noise. Our work establishes that federated learning, when tailored to domain-specific challenges, can become a superior paradigm for collaborative epigenomic research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FL-Sailer brings the first federated setup to scATAC-seq with leverage sampling and an invariant VAE, but its claim of beating centralized training lacks support from the convergence bound or the experiments shown.

read the letter

The paper's core contribution is a federated learning pipeline tailored to scATAC-seq that combines adaptive leverage-score sampling for 80% dimensionality reduction with an invariant VAE that minimizes mutual information to separate biology from technical batch effects. It also states a convergence guarantee for the federated process to an approximate solution of the original problem. Those pieces are new in this exact combination for this data type, and the work directly addresses the practical barrier of privacy rules that block multi-site chromatin accessibility studies. The experiments on synthetic and real epigenomic sets show the method runs where full data sharing is impossible, which is useful on its own. The sampling step is presented as an implicit regularizer that reduces noise better than full centralized training. That is the part that does not hold up. The convergence result only bounds the error between the sampled federated objective and the unsampled one; it does not demonstrate that the sampled objective has lower effective noise or better generalization than training on the complete centralized matrix. If the leverage scores are driven by site-specific sparsity patterns or batch artifacts rather than shared biological features, the reduction can discard signal or add selection bias that the VAE term may not correct. The abstract gives no derivation of the bound, no error bars on the performance claims, and no ablation that isolates the sampling effect from the rest of the pipeline. Those gaps make the superiority result hard to trust without further checks. Readers working on privacy-preserving single-cell methods or federated learning for sparse genomics data will find the setup worth looking at for ideas on handling heterogeneity. The paper is coherent enough on its own terms to merit a serious referee, though any review should focus on whether the sampling truly improves over full data rather than just enabling the federated case. I would send it to peer review with requests for the bound derivation, controlled ablations, and clearer dataset descriptions.

Referee Report

4 major / 0 minor

Summary. The paper proposes FL-Sailer, the first federated learning framework tailored to scATAC-seq data. It combines adaptive leverage score sampling (claimed to reduce dimensionality by 80% while selecting biologically interpretable features) with an invariant VAE that uses mutual-information minimization to disentangle biological signals from technical confounders. A convergence guarantee is stated for the federated procedure, and experiments on synthetic and real epigenomic datasets are said to show that the method enables previously infeasible multi-institutional collaborations and outperforms centralized training by treating adaptive sampling as an implicit regularizer that suppresses technical noise.

Significance. If the superiority claim and the biological invariance of the selected features hold, the work would be significant: it would provide a practical route to privacy-preserving collaborative analysis of large, distributed scATAC-seq collections that currently cannot be centralized, while potentially improving robustness over full-data centralized training.

major comments (4)

[Abstract] Abstract: the central claim that 'FL-Sailer ... surpasses centralized methods by leveraging adaptive sampling as an implicit regularizer to suppress technical noise' is not supported by the stated convergence guarantee, which only bounds approximation error to the unsampled objective and does not establish that the sampled objective has lower effective noise or better generalization than the full centralized objective.
[Abstract] Abstract: no derivation, assumptions, or explicit error bound is provided for the convergence guarantee, despite the guarantee being presented as a key contribution that justifies the 80% dimensionality reduction on ultra-high-dimensional sparse matrices.
[Abstract] Abstract: the experimental superiority is asserted without any description of the datasets, baselines (including centralized full-data training), metrics, or statistical reporting (e.g., error bars or significance tests), leaving the claim that sampling suppresses technical noise unverifiable.
[Abstract] Abstract: the assumption that local leverage scores computed on heterogeneous, site-specific sparse matrices recover features whose chromatin accessibility patterns are biologically invariant across institutions is load-bearing for the regularizer argument but receives no supporting argument or test.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our paper. We address each of the major comments point by point below. We agree that the abstract requires clarification and will make revisions accordingly to better align the claims with the supporting evidence in the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'FL-Sailer ... surpasses centralized methods by leveraging adaptive sampling as an implicit regularizer to suppress technical noise' is not supported by the stated convergence guarantee, which only bounds approximation error to the unsampled objective and does not establish that the sampled objective has lower effective noise or better generalization than the full centralized objective.

Authors: We agree with the referee that the convergence guarantee does not by itself prove that the sampled objective has superior generalization or lower effective noise compared to the full centralized objective; it only provides a bound on the approximation error. The assertion of surpassing centralized methods is grounded in our experimental findings, where adaptive sampling leads to improved performance, consistent with a regularizing effect. We will revise the abstract to distinguish between the theoretical guarantee and the empirical observations. revision: yes
Referee: [Abstract] Abstract: no derivation, assumptions, or explicit error bound is provided for the convergence guarantee, despite the guarantee being presented as a key contribution that justifies the 80% dimensionality reduction on ultra-high-dimensional sparse matrices.

Authors: The full manuscript contains the derivation of the convergence bound, including the assumptions (e.g., on the sampling probabilities and data heterogeneity) and the explicit error term. The abstract, however, does not include these details due to its brevity. We will revise the abstract to briefly reference the guarantee and its role in justifying the dimensionality reduction. revision: yes
Referee: [Abstract] Abstract: the experimental superiority is asserted without any description of the datasets, baselines (including centralized full-data training), metrics, or statistical reporting (e.g., error bars or significance tests), leaving the claim that sampling suppresses technical noise unverifiable.

Authors: We acknowledge that the abstract lacks specific details on the experimental setup. The complete paper describes the datasets, includes comparisons to centralized full-data training, reports metrics with error bars and significance tests. To address this, we will incorporate a more informative summary of the experiments into the revised abstract without exceeding length constraints. revision: yes
Referee: [Abstract] Abstract: the assumption that local leverage scores computed on heterogeneous, site-specific sparse matrices recover features whose chromatin accessibility patterns are biologically invariant across institutions is load-bearing for the regularizer argument but receives no supporting argument or test.

Authors: The referee correctly identifies that this assumption underpins the regularizer interpretation. While the manuscript shows that the selected features are biologically meaningful and lead to consistent performance gains, we did not include a dedicated test or argument for cross-institutional invariance of the accessibility patterns. We will add such an analysis in the revised version to support this claim. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on stated guarantees and empirical results rather than definitional reduction

full rationale

The abstract and described claims present a convergence guarantee that bounds approximation error to the unsampled objective and an empirical demonstration that adaptive sampling acts as a regularizer. No equations or steps are shown that reduce the superiority claim or the bound to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled via prior work. The mutual-information term and leverage-score selection are introduced as architectural choices whose validity is asserted via experiments rather than derived tautologically from the target result. The derivation chain therefore remains self-contained against external benchmarks and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; the sampling and VAE components are described at a high level without stating what quantities are fitted or assumed.

pith-pipeline@v0.9.0 · 5524 in / 1235 out tokens · 39909 ms · 2026-05-08T16:33:54.331615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Let w∗ = arg min w∥Aw−y∥2 2 = A†y be the optimal solution in the original space, and ˜w∗ = arg min ˜w∥AT ˜w−y∥2 2 = (AT )†y be the optimal solution in the reduced space

work page
[2]

Under the assumptions of Lemma 4.1, we have ∥Awrecon−y∥2 2 =∥Aw∗−y∥2 2

The reconstruction wrecon =T ˜w∗ provides an exact solution in terms of objective value. Under the assumptions of Lemma 4.1, we have ∥Awrecon−y∥2 2 =∥Aw∗−y∥2 2

work page
[3]

The weighted aggregation in Algorithm 1 computes an approximation to the global leverage scores

The gradient information is preserved: for any ˜w∈Rs, if ∇f (w) = 2 A⊤ (Aw−y) and∇˜f ( ˜w) = 2(AT )⊤ (AT ˜w−y), there is (1−ϵ)∥∇f (T ˜w)∥2 2≤∥∇˜f ( ˜w)∥2 2≤(1 +ϵ)∥∇f (T ˜w)∥2 2 Remark B.2 (Federated Leverage Score Aggregation) . The weighted aggregation in Algorithm 1 computes an approximation to the global leverage scores. For scATAC-seq data, all instit...

work page
[4]

SinceAw∗∈col(A), the column space equality implies Aw∗∈col( ˜A)

(4) This subspace embedding property implies that the column spaces of the original and sampled matrices are identical: col( ˜A) = ker( ˜A⊤ )⊥ = ker(A⊤ )⊥ = col(A), where the first and last equalities follow from the fundamental theorem of linear algebra, and the middle equality follows from ( 4) which implies ker(A⊤ ) = ker( ˜A⊤ ) (since for ϵ∈(0, 1),∥A⊤...

work page 2026
[5]

For the composed function at w =T ˜w, we have ∇f (T ˜w) = 2A⊤ (AT ˜w−y)

We aim to prove (1−ϵ)∥∇f (T ˜w)∥2 2≤∥∇˜f ( ˜w)∥2 2≤(1 +ϵ)∥∇f (T ˜w)∥2 2 Here the gradients are ∇f (w) = 2A⊤ (Aw−y) ∇˜f ( ˜w) = 2(AT )⊤ (AT ˜w−y). For the composed function at w =T ˜w, we have ∇f (T ˜w) = 2A⊤ (AT ˜w−y). Therefore, ∥∇˜f ( ˜w)∥2 2 = 4∥(AT )⊤ (AT ˜w−y)∥2 2; ∥∇f (T ˜w)∥2 2 = 4∥A⊤ (AT ˜w−y)∥2 2. Let e :=AT ˜w−y∈Rn. By Lemma 4.1, for all v∈Rn, (...

work page 2026
[6]

The separation after sampling ˜∆ ij satisfies: P [ (1−2ϵ)∆ ij≤˜∆ ij≤(1 + 2ϵ)∆ ij ] ≥1−δ

Cell Type Separability Preservation : For any two distinct cell types Ci,C j, define their separation in the original space as ∆ ij =∥µi− µj∥2/ √ σ2 i +σ2 j whereµi,µj are class centers and σ2 i,σ2 j are within-class variances. The separation after sampling ˜∆ ij satisfies: P [ (1−2ϵ)∆ ij≤˜∆ ij≤(1 + 2ϵ)∆ ij ] ≥1−δ

work page
[7]

Then the DB index before and after sampling satisfies: |DBsampled−DBoriginal|≤2ϵ·DBoriginal

Clustering Structure Preservation : Define the Davies-Bouldin index as DB = 1 K ∑K i=1 maxj̸=i σi+σj ∥µi− µj ∥2 . Then the DB index before and after sampling satisfies: |DBsampled−DBoriginal|≤2ϵ·DBoriginal

work page
[8]

B.2.1 Proof of Theorem B.3, Part 1 Proof

Biological Marker Preservation: If genomic region j is a marker for cell type Ci (i.e., its average accessibility in Ci is significantly higher than in other types), then its selection probability satisfies: πj≥min { 1, s r·∥aCi,j −a¬Ci,j∥2 2∑d ℓ=1∥aCi,ℓ−a¬Ci,ℓ∥2 2 } whereaCi,j denotes the average accessibility of cell type Ci at region j. B.2.1 Proof of ...

work page
[9]

(7) Since µi−µj = 1 |Ci| ∑ k∈Ci a⊤ k−1 |Cj | ∑ ℓ∈Cj a⊤ ℓ is a linear combination of rows of A, it lies in the row space of A. Therefore: (1−ϵ)∥µi−µj∥2 2≤∥(µi−µj)T∥2 2 =∥˜µi−˜µj∥2 2≤(1 +ϵ)∥µi−µj∥2 2, (8) where the inequalities follow from applying ( 7) with v =µi−µj, and the equality follows from the linearity of matrix multiplication. Taking square roots ...

work page
[10]

Similarly, under the event of ( 7), we have (1−ϵ)σ2 j≤˜σ2 j≤(1 +ϵ)σ2 j

(10) 18 Published in Transactions on Machine Learning Research (05/2026) Since this two-sided bounds holds simultaneously for all i′∈Ci, ˜σ2 i = 1 |Ci| ∑ i′∈Ci ∥(ai′−µi)T∥2 2 ∈ [ 1 |Ci| ∑ i′∈Ci (1−ϵ)∥ai′−µi∥2 2, 1 |Ci| ∑ i′∈Ci (1 +ϵ)∥ai′−µi∥2 2 ] = [(1−ϵ)σ2 i, (1 +ϵ)σ2 i ], (11) where the second step follows from applying the bounds in ( 10) pointwise, an...

work page 2026
[11]

Bound on inter-class distance (from ( 9)): √ 1−ϵ∥µi−µj∥2≤∥˜µi−˜µj∥2≤ √ 1 +ϵ∥µi−µj∥2

work page
[12]

Now, we combine these bounds to analyze ˜Rij = ˜σi+˜σj ∥˜µi− ˜µj ∥2

Bound on intra-class standard deviation (from ( 11) after taking square root): √ 1−ϵσi≤˜σi≤ √ 1 +ϵσi. Now, we combine these bounds to analyze ˜Rij = ˜σi+˜σj ∥˜µi− ˜µj ∥2 . For the upper bound of ˜Rij: ˜Rij = ˜σi + ˜σj ∥˜µi−˜µj∥2 ≤ √1 +ϵ(σi +σj)√1−ϵ∥µi−µj∥2 = √ 1 +ϵ 1−ϵ·Rij. For the lower bound of ˜Rij: ˜Rij≥ √1−ϵ(σi +σj)√1 +ϵ∥µi−µj∥2 = √ 1−ϵ 1 +ϵ·Rij. As ...

work page 2026
[13]

Parameter Lipschitz continuity: For any input x∈Xand any pair of parameter vectors θ1,θ2, the mapping functions satisfy: ∥µθ1 (x)−µθ2 (x)∥≤Lµ∥θ1−θ2∥, ∥σθ1(x)−σθ2 (x)∥≤Lσ∥θ1−θ2∥

work page
[14]

Bounded variance: σmin≤σθ(x)≤σmax for all x∈X; 23 Published in Transactions on Machine Learning Research (05/2026)

work page 2026
[15]

Assumption C.5 (Decoder Invariance Properties)

Bounded mean: ∥µθ(x)∥≤M for all x∈X. Assumption C.5 (Decoder Invariance Properties) . The decoder pϕ(x|z,c ) follows a Gaussian distribution satisfies the following properties for invariant representation learning:

work page
[16]

This ensures that technical variations c only moderately modulate how biological signals z are decoded

Bounded interaction strength: There exists a constant γ >0 such that for any z∈Rdz and confounding factors c1,c 2, the decoder’s sensitivity to z has bounded variation: ∥∇z logpϕ(x|z,c 1)−∇z logpϕ(x|z,c 2)∥≤γ∥c1−c2∥. This ensures that technical variations c only moderately modulate how biological signals z are decoded

work page
[17]

Gradient orthogonality: LetLrecon(θ,ϕ) = −Ex,c [Ez∼ qθ(z|x)[logpϕ(x|z,c )]]. For the gradient components corresponding to z-dependent and c-dependent transformations: ⏐⏐⏐⏐Ex,c,z [⟨∂logpϕ(x|z,c ) ∂z , ∂logpϕ(x|z,c ) ∂c ⟩]⏐⏐⏐⏐≤κ √E [  ∂logpϕ ∂z  2] ·E [  ∂logpϕ ∂c  2] for a small constant κ∈(0, 1)

work page
[18]

D End-to-End Convergence Analysis of FL-Sailer in High-Dimensional Spaces This appendix provides the detailed proofs and technical verifications referenced in Section 5

Confounding factor necessity: The decoder requires confounding information for accurate re- construction: there exists ∆ > 0 such that inf z sup c1̸=c2 DKL(pϕ(x|z,c 1)∥pϕ(x|z,c 2))≥∆. D End-to-End Convergence Analysis of FL-Sailer in High-Dimensional Spaces This appendix provides the detailed proofs and technical verifications referenced in Section 5. We ...

work page 2022
[19]

(a,b )-semi-smoothness with a =O( √ 1 +λ(Lenc +Ldec)) and b =O((1 +λ)(Lenc +Ldec)2)

work page
[20]

(τ1,τ2)-non-critical gradient in the region of interest for optimization

work page
[21]

Proof Sketch

(α,β)-semi-Lipschitz with β2 =O((1 +λ)2(Lenc +Ldec)4) and α2 =O((1 +λ)3/ 2(Lenc +Ldec)4). Proof Sketch. Given Assumptions C.4 and C.5, which provide Lipschitz constants Lenc and Ldec for the encoder and decoder networks respectively, we can characterize the behavior of each loss component. For networks with piecewise smooth activation functions, the compo...

work page 2026
[22]

(2022), each component satisfies semi-Lipschitz properties

(35) Following the theoretical framework for neural network gradients Li et al. (2022), each component satisfies semi-Lipschitz properties. Let Gmax,i = max{Li(W )1/ 2,Li(U )1/ 2}for i∈{prior, marginal, recon}. 27 Published in Transactions on Machine Learning Research (05/2026) For the prior KL term: ∥∇Lprior(W )−∇Lprior(U )∥2 2≤β2 prior∥W−U∥2 2 +α2 prior...

work page 2022
[23]

We define the row-space lifting matrix Rlift∈Rd× s as: Rlift :=V (T ⊤V )†

(39) 28 Published in Transactions on Machine Learning Research (05/2026) Let A = U ΣV ⊤ be the thin SVD of A, where V ∈Rd× r contains orthonormal columns that span row (A). We define the row-space lifting matrix Rlift∈Rd× s as: Rlift :=V (T ⊤V )†. (40) Under the event E, the matrix T ⊤V ∈Rs× r has full column rank r, implying that (T ⊤V )†(T ⊤V ) = Ir. Co...

work page 2026
[24]

(52) This implies that the mapping ψ(x) = xT is a bi-Lipschitz embedding from the original data manifold to the sampled space with distortion at most κ(ϵ)≈√1 +ϵ. Proof. This follows directly from Lemma 4.1. Since x,y ∈row(A), their difference v = x−y also lies in row(A). Applying the subspace embedding property: (1−ϵ)∥v∥2 2≤∥vT∥2 2≤(1 +ϵ)∥v∥2 2. Step 2: R...

work page
[25]

(˜a, ˜b)-semi-smoothness with ˜a =O( ˜Lenc) and ˜b =O( ˜L2 enc)

work page
[26]

( ˜α,˜β)-semi-Lipschitz continuity; 31 Published in Transactions on Machine Learning Research (05/2026)

work page 2026
[27]

minority class collapse

(˜τ1, ˜τ2)-non-critical point condition. Therefore, the sampled optimization problem satisfies all assumptions of the FedA vg convergence theorem (Theorem D.1). Applying Theorem D.1 to ˜L yields: E[ ˜L( ˜U (R))]−˜L( ˜U ∗)≤(1−λ1)R∆ 0 + 2λ2, (53) whereλ1,λ2 are defined as in Theorem D.1 using the sampled constants. Under Assumption D.7, this yields the stan...

work page 2022

[1] [1]

Let w∗ = arg min w∥Aw−y∥2 2 = A†y be the optimal solution in the original space, and ˜w∗ = arg min ˜w∥AT ˜w−y∥2 2 = (AT )†y be the optimal solution in the reduced space

work page

[2] [2]

Under the assumptions of Lemma 4.1, we have ∥Awrecon−y∥2 2 =∥Aw∗−y∥2 2

The reconstruction wrecon =T ˜w∗ provides an exact solution in terms of objective value. Under the assumptions of Lemma 4.1, we have ∥Awrecon−y∥2 2 =∥Aw∗−y∥2 2

work page

[3] [3]

The weighted aggregation in Algorithm 1 computes an approximation to the global leverage scores

The gradient information is preserved: for any ˜w∈Rs, if ∇f (w) = 2 A⊤ (Aw−y) and∇˜f ( ˜w) = 2(AT )⊤ (AT ˜w−y), there is (1−ϵ)∥∇f (T ˜w)∥2 2≤∥∇˜f ( ˜w)∥2 2≤(1 +ϵ)∥∇f (T ˜w)∥2 2 Remark B.2 (Federated Leverage Score Aggregation) . The weighted aggregation in Algorithm 1 computes an approximation to the global leverage scores. For scATAC-seq data, all instit...

work page

[4] [4]

SinceAw∗∈col(A), the column space equality implies Aw∗∈col( ˜A)

(4) This subspace embedding property implies that the column spaces of the original and sampled matrices are identical: col( ˜A) = ker( ˜A⊤ )⊥ = ker(A⊤ )⊥ = col(A), where the first and last equalities follow from the fundamental theorem of linear algebra, and the middle equality follows from ( 4) which implies ker(A⊤ ) = ker( ˜A⊤ ) (since for ϵ∈(0, 1),∥A⊤...

work page 2026

[5] [5]

For the composed function at w =T ˜w, we have ∇f (T ˜w) = 2A⊤ (AT ˜w−y)

We aim to prove (1−ϵ)∥∇f (T ˜w)∥2 2≤∥∇˜f ( ˜w)∥2 2≤(1 +ϵ)∥∇f (T ˜w)∥2 2 Here the gradients are ∇f (w) = 2A⊤ (Aw−y) ∇˜f ( ˜w) = 2(AT )⊤ (AT ˜w−y). For the composed function at w =T ˜w, we have ∇f (T ˜w) = 2A⊤ (AT ˜w−y). Therefore, ∥∇˜f ( ˜w)∥2 2 = 4∥(AT )⊤ (AT ˜w−y)∥2 2; ∥∇f (T ˜w)∥2 2 = 4∥A⊤ (AT ˜w−y)∥2 2. Let e :=AT ˜w−y∈Rn. By Lemma 4.1, for all v∈Rn, (...

work page 2026

[6] [6]

The separation after sampling ˜∆ ij satisfies: P [ (1−2ϵ)∆ ij≤˜∆ ij≤(1 + 2ϵ)∆ ij ] ≥1−δ

Cell Type Separability Preservation : For any two distinct cell types Ci,C j, define their separation in the original space as ∆ ij =∥µi− µj∥2/ √ σ2 i +σ2 j whereµi,µj are class centers and σ2 i,σ2 j are within-class variances. The separation after sampling ˜∆ ij satisfies: P [ (1−2ϵ)∆ ij≤˜∆ ij≤(1 + 2ϵ)∆ ij ] ≥1−δ

work page

[7] [7]

Then the DB index before and after sampling satisfies: |DBsampled−DBoriginal|≤2ϵ·DBoriginal

Clustering Structure Preservation : Define the Davies-Bouldin index as DB = 1 K ∑K i=1 maxj̸=i σi+σj ∥µi− µj ∥2 . Then the DB index before and after sampling satisfies: |DBsampled−DBoriginal|≤2ϵ·DBoriginal

work page

[8] [8]

B.2.1 Proof of Theorem B.3, Part 1 Proof

Biological Marker Preservation: If genomic region j is a marker for cell type Ci (i.e., its average accessibility in Ci is significantly higher than in other types), then its selection probability satisfies: πj≥min { 1, s r·∥aCi,j −a¬Ci,j∥2 2∑d ℓ=1∥aCi,ℓ−a¬Ci,ℓ∥2 2 } whereaCi,j denotes the average accessibility of cell type Ci at region j. B.2.1 Proof of ...

work page

[9] [9]

(7) Since µi−µj = 1 |Ci| ∑ k∈Ci a⊤ k−1 |Cj | ∑ ℓ∈Cj a⊤ ℓ is a linear combination of rows of A, it lies in the row space of A. Therefore: (1−ϵ)∥µi−µj∥2 2≤∥(µi−µj)T∥2 2 =∥˜µi−˜µj∥2 2≤(1 +ϵ)∥µi−µj∥2 2, (8) where the inequalities follow from applying ( 7) with v =µi−µj, and the equality follows from the linearity of matrix multiplication. Taking square roots ...

work page

[10] [10]

Similarly, under the event of ( 7), we have (1−ϵ)σ2 j≤˜σ2 j≤(1 +ϵ)σ2 j

(10) 18 Published in Transactions on Machine Learning Research (05/2026) Since this two-sided bounds holds simultaneously for all i′∈Ci, ˜σ2 i = 1 |Ci| ∑ i′∈Ci ∥(ai′−µi)T∥2 2 ∈ [ 1 |Ci| ∑ i′∈Ci (1−ϵ)∥ai′−µi∥2 2, 1 |Ci| ∑ i′∈Ci (1 +ϵ)∥ai′−µi∥2 2 ] = [(1−ϵ)σ2 i, (1 +ϵ)σ2 i ], (11) where the second step follows from applying the bounds in ( 10) pointwise, an...

work page 2026

[11] [11]

Bound on inter-class distance (from ( 9)): √ 1−ϵ∥µi−µj∥2≤∥˜µi−˜µj∥2≤ √ 1 +ϵ∥µi−µj∥2

work page

[12] [12]

Now, we combine these bounds to analyze ˜Rij = ˜σi+˜σj ∥˜µi− ˜µj ∥2

Bound on intra-class standard deviation (from ( 11) after taking square root): √ 1−ϵσi≤˜σi≤ √ 1 +ϵσi. Now, we combine these bounds to analyze ˜Rij = ˜σi+˜σj ∥˜µi− ˜µj ∥2 . For the upper bound of ˜Rij: ˜Rij = ˜σi + ˜σj ∥˜µi−˜µj∥2 ≤ √1 +ϵ(σi +σj)√1−ϵ∥µi−µj∥2 = √ 1 +ϵ 1−ϵ·Rij. For the lower bound of ˜Rij: ˜Rij≥ √1−ϵ(σi +σj)√1 +ϵ∥µi−µj∥2 = √ 1−ϵ 1 +ϵ·Rij. As ...

work page 2026

[13] [13]

Parameter Lipschitz continuity: For any input x∈Xand any pair of parameter vectors θ1,θ2, the mapping functions satisfy: ∥µθ1 (x)−µθ2 (x)∥≤Lµ∥θ1−θ2∥, ∥σθ1(x)−σθ2 (x)∥≤Lσ∥θ1−θ2∥

work page

[14] [14]

Bounded variance: σmin≤σθ(x)≤σmax for all x∈X; 23 Published in Transactions on Machine Learning Research (05/2026)

work page 2026

[15] [15]

Assumption C.5 (Decoder Invariance Properties)

Bounded mean: ∥µθ(x)∥≤M for all x∈X. Assumption C.5 (Decoder Invariance Properties) . The decoder pϕ(x|z,c ) follows a Gaussian distribution satisfies the following properties for invariant representation learning:

work page

[16] [16]

This ensures that technical variations c only moderately modulate how biological signals z are decoded

Bounded interaction strength: There exists a constant γ >0 such that for any z∈Rdz and confounding factors c1,c 2, the decoder’s sensitivity to z has bounded variation: ∥∇z logpϕ(x|z,c 1)−∇z logpϕ(x|z,c 2)∥≤γ∥c1−c2∥. This ensures that technical variations c only moderately modulate how biological signals z are decoded

work page

[17] [17]

Gradient orthogonality: LetLrecon(θ,ϕ) = −Ex,c [Ez∼ qθ(z|x)[logpϕ(x|z,c )]]. For the gradient components corresponding to z-dependent and c-dependent transformations: ⏐⏐⏐⏐Ex,c,z [⟨∂logpϕ(x|z,c ) ∂z , ∂logpϕ(x|z,c ) ∂c ⟩]⏐⏐⏐⏐≤κ √E [  ∂logpϕ ∂z  2] ·E [  ∂logpϕ ∂c  2] for a small constant κ∈(0, 1)

work page

[18] [18]

D End-to-End Convergence Analysis of FL-Sailer in High-Dimensional Spaces This appendix provides the detailed proofs and technical verifications referenced in Section 5

Confounding factor necessity: The decoder requires confounding information for accurate re- construction: there exists ∆ > 0 such that inf z sup c1̸=c2 DKL(pϕ(x|z,c 1)∥pϕ(x|z,c 2))≥∆. D End-to-End Convergence Analysis of FL-Sailer in High-Dimensional Spaces This appendix provides the detailed proofs and technical verifications referenced in Section 5. We ...

work page 2022

[19] [19]

(a,b )-semi-smoothness with a =O( √ 1 +λ(Lenc +Ldec)) and b =O((1 +λ)(Lenc +Ldec)2)

work page

[20] [20]

(τ1,τ2)-non-critical gradient in the region of interest for optimization

work page

[21] [21]

Proof Sketch

(α,β)-semi-Lipschitz with β2 =O((1 +λ)2(Lenc +Ldec)4) and α2 =O((1 +λ)3/ 2(Lenc +Ldec)4). Proof Sketch. Given Assumptions C.4 and C.5, which provide Lipschitz constants Lenc and Ldec for the encoder and decoder networks respectively, we can characterize the behavior of each loss component. For networks with piecewise smooth activation functions, the compo...

work page 2026

[22] [22]

(2022), each component satisfies semi-Lipschitz properties

(35) Following the theoretical framework for neural network gradients Li et al. (2022), each component satisfies semi-Lipschitz properties. Let Gmax,i = max{Li(W )1/ 2,Li(U )1/ 2}for i∈{prior, marginal, recon}. 27 Published in Transactions on Machine Learning Research (05/2026) For the prior KL term: ∥∇Lprior(W )−∇Lprior(U )∥2 2≤β2 prior∥W−U∥2 2 +α2 prior...

work page 2022

[23] [23]

We define the row-space lifting matrix Rlift∈Rd× s as: Rlift :=V (T ⊤V )†

(39) 28 Published in Transactions on Machine Learning Research (05/2026) Let A = U ΣV ⊤ be the thin SVD of A, where V ∈Rd× r contains orthonormal columns that span row (A). We define the row-space lifting matrix Rlift∈Rd× s as: Rlift :=V (T ⊤V )†. (40) Under the event E, the matrix T ⊤V ∈Rs× r has full column rank r, implying that (T ⊤V )†(T ⊤V ) = Ir. Co...

work page 2026

[24] [24]

(52) This implies that the mapping ψ(x) = xT is a bi-Lipschitz embedding from the original data manifold to the sampled space with distortion at most κ(ϵ)≈√1 +ϵ. Proof. This follows directly from Lemma 4.1. Since x,y ∈row(A), their difference v = x−y also lies in row(A). Applying the subspace embedding property: (1−ϵ)∥v∥2 2≤∥vT∥2 2≤(1 +ϵ)∥v∥2 2. Step 2: R...

work page

[25] [25]

(˜a, ˜b)-semi-smoothness with ˜a =O( ˜Lenc) and ˜b =O( ˜L2 enc)

work page

[26] [26]

( ˜α,˜β)-semi-Lipschitz continuity; 31 Published in Transactions on Machine Learning Research (05/2026)

work page 2026

[27] [27]

minority class collapse

(˜τ1, ˜τ2)-non-critical point condition. Therefore, the sampled optimization problem satisfies all assumptions of the FedA vg convergence theorem (Theorem D.1). Applying Theorem D.1 to ˜L yields: E[ ˜L( ˜U (R))]−˜L( ˜U ∗)≤(1−λ1)R∆ 0 + 2λ2, (53) whereλ1,λ2 are defined as in Theorem D.1 using the sampled constants. Under Assumption D.7, this yields the stan...

work page 2022