Client-Conditional Federated Learning via Local Training Data Statistics

Rickard Br\"annvall

arxiv: 2603.11307 · v2 · submitted 2026-03-11 · 💻 cs.LG

Client-Conditional Federated Learning via Local Training Data Statistics

Rickard Br\"annvall This is my paper

Pith reviewed 2026-05-15 12:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningdata heterogeneityPCA conditioningglobal model adaptationsparsity robustnessoracle baselinelabel shiftcovariate shift

0 comments

The pith

Conditioning one global model on local PCA statistics matches oracle performance in federated learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a single shared model to heterogeneous client data by conditioning its predictions on principal component summaries computed locally from each client's training examples. These summaries require no extra communication rounds beyond ordinary federated averaging. Across 97 experimental configurations covering label, covariate, concept, and combined shifts on four image datasets, the approach equals an oracle that knows the true data clusters and exceeds that oracle by one to six percent when heterogeneity is multi-dimensional.

Core claim

By conditioning the parameters of a single global model on the locally computed PCA statistics of each client's training data, the method reaches the accuracy of an oracle baseline that has access to true cluster assignments in every tested setting, surpasses that oracle by 1-6 percent under combined heterogeneity, and retains performance when client data becomes sparse while other methods degrade.

What carries the argument

Conditioning a shared global model on per-client PCA summaries of local training data, computed once locally with zero added communication.

If this is right

A single model can handle multi-dimensional heterogeneity without maintaining separate per-client models or discovering explicit clusters.
No increase in communication cost is required compared with standard federated averaging.
Continuous local statistics outperform discrete cluster identifiers when the data shifts contain richer structure than simple group membership.
Accuracy stays stable as client datasets shrink, giving the method an advantage in sparse-data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous summaries of local data may preserve more information than categorical cluster labels when heterogeneity spans several dimensions at once.
The same conditioning idea could be tested with other low-dimensional local statistics such as moments or small embeddings.
Server-side storage remains limited to one model while still delivering client-specific behavior at inference time.
The sparsity robustness points toward possible use on resource-constrained devices that hold only small local sets.

Load-bearing premise

That the top principal components extracted from each client's local data are sufficient to capture the variations needed to adapt the global model correctly for every type of heterogeneity.

What would settle it

A new heterogeneity type or dataset in which the leading principal components of client data do not separate the predictive patterns, causing the conditioned model to fall measurably below the oracle baseline.

Figures

Figures reproduced from arXiv: 2603.11307 by Rickard Br\"annvall.

**Figure 1.** Figure 1: Client-conditional pipeline for client i. Green boxes are client-local; orange boxes involve the federation. Prepare: PCA eigenvalues si are computed once from the training data. Train: model updates are computed locally and aggregated via federated learning to produce the shared model θ. Infer: the client uses the shared model θ and its own si for predictions. Our contributions are: 1) A method that condi… view at source ↗

**Figure 2.** Figure 2: Four FL paradigms under data heterogeneity. (a) FedAvg: one global model, no personalization. (b) Clustered: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Conditional exceeds Oracle on complex heterogeneity. Left: E3b label permutation on CIFAR-10 ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sparsity robustness on E1 Label Shift (K=2). Conditional and Oracle maintain flat accuracy as data decreases 20-fold (from ∼6,000 to ∼200 samples/client). All other methods degrade, with Gossip collapsing to near-random. 7 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: MCDC EMNIST: per-character accuracy for digits, uppercase, and lowercase character subsets (1 PCA component). The [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: MCDC EMNIST: accuracy on confusable character pairs across subsets. The global model (rightmost) shows systematic [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: E1: Test accuracy vs. number of clusters (K) for label shift heterogeneity across four datasets. Conditional (our method) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: E1: Method performance vs. data sparsity (K=2) for label shift. Conditional maintains Oracle-level accuracy across all [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: E1: Complete results heatmap across all datasets and K values. Conditional (our method, bold) consistently matches [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: E2a: Accuracy vs. K for covariate shift (subsampling). With local test evaluation, Conditional matches Oracle across [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: E2a: Accuracy vs. data sparsity (K=2) for covariate shift. Conditional is perfectly sparsity-invariant, while Gossip [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: E2b: Rotation-based covariate shift results across all four configurations. Conditional closely tracks Oracle across all [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: E3a: Semantic concept shift sparsity sweep. Conditional maintains near-Oracle performance while other methods [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: E3a: Gap to Oracle across sparsity levels (personalized methods only). Conditional stays closest to Oracle (negative [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: E3b: Label permutation K sweep. Conditional consistently outperforms Oracle on CIFAR-10, demonstrating the value [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: E3b: Label permutation sparsity sweep (K=2). Gossip collapses to random at sparse settings; Conditional remains the [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: E4a: MNIST + FMNIST domain shift sparsity sweep. Conditional matches Oracle; Gossip collapses at Very Sparse. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: E4b: Combined heterogeneity configuration sweep. Conditional exceeds Oracle across all configurations. [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: E4b: Combined heterogeneity sparsity sweep (C=2). Conditional beats Oracle at Rich/Medium and remains competitive [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

read the original abstract

Federated learning (FL) under data heterogeneity remains challenging: existing methods either ignore client differences (FedAvg), require costly cluster discovery (IFCA), or maintain per-client models (Ditto). All degrade when data is sparse or heterogeneity is multi-dimensional. We propose conditioning a single global model on locally-computed PCA statistics of each client's training data, requiring zero additional communication. Evaluating across 97~configurations spanning four heterogeneity types (label shift, covariate shift, concept shift, and combined heterogeneity), four datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100), and seven FL baseline methods, we find that our method matches the Oracle baseline -- which knows true cluster assignments -- across all settings, surpasses it by 1--6% on combined heterogeneity where continuous statistics are richer than discrete cluster identifiers, and is uniquely sparsity-robust among all tested methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's PCA-based conditioning gives a simple no-communication personalization route that matches oracle results in their tests, but the label-shift case rests on an assumption that may not hold.

read the letter

The core idea is to condition one global model on each client's locally computed PCA statistics from their training data. This avoids extra rounds of communication and sidesteps the need for explicit clustering or per-client models. Their 97-configuration sweep across MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 covers label, covariate, concept, and combined shifts, and the reported outcome is that the method matches the oracle baseline while beating it slightly on combined heterogeneity and holding up better under sparse data.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a client-conditional federated learning approach that conditions a single global model on locally computed PCA statistics of each client's training data, with zero additional communication. It reports results across 97 configurations covering label shift, covariate shift, concept shift, and combined heterogeneity on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, comparing against seven baselines and an Oracle that knows true cluster assignments. The central empirical claim is that the method matches Oracle accuracy in all settings, exceeds it by 1-6% on combined heterogeneity, and is uniquely robust to sparsity.

Significance. If the empirical claims hold after clarification, the work would provide a simple, communication-free mechanism for handling multi-dimensional heterogeneity that matches or exceeds more complex clustering or per-client methods. The breadth of the 97-configuration evaluation across four heterogeneity types and the observation that continuous PCA statistics can outperform discrete cluster identifiers on combined shifts represent a practical strength for real-world FL deployments where data is sparse.

major comments (2)

[Abstract] Abstract: the claim that the method matches the Oracle baseline across all settings (including pure label shift) rests on the assumption that locally computed PCA statistics on feature vectors provide sufficient client-discriminating information. Under label shift with identical class-conditional feature distributions, the principal components and eigenvalues would be essentially identical across clients, supplying no conditioning signal; the evaluation must therefore demonstrate that the tested label-shift regimes still yield distinguishable PCA statistics, which is not guaranteed by the problem setup.
[Evaluation] Evaluation section (implied by the 97-configuration results): the manuscript provides no details on the exact mechanism by which PCA statistics condition the global model, reports no error bars or variance across runs, and does not disclose potential post-hoc choices in configuration selection or metric aggregation. These omissions leave the central claim of Oracle-matching performance only partially supported.

minor comments (1)

[Abstract] Abstract: the breakdown of the 97 configurations across the four heterogeneity types should be stated explicitly to allow readers to assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our empirical claims. We address each major point below and have revised the manuscript accordingly to strengthen the presentation and support for our results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method matches the Oracle baseline across all settings (including pure label shift) rests on the assumption that locally computed PCA statistics on feature vectors provide sufficient client-discriminating information. Under label shift with identical class-conditional feature distributions, the principal components and eigenvalues would be essentially identical across clients, supplying no conditioning signal; the evaluation must therefore demonstrate that the tested label-shift regimes still yield distinguishable PCA statistics, which is not guaranteed by the problem setup.

Authors: We agree that, in an idealized pure label-shift setting with identical class-conditional feature distributions, PCA statistics computed on features would be identical across clients and provide no discriminative signal. In our experimental label-shift regimes, clients receive different label proportions drawn from the same underlying class-conditional distributions; any observed differences in PCA vectors therefore arise only from finite-sample effects during client data partitioning. To address the concern directly, we have added a new paragraph and supplementary table in the evaluation section that reports the average pairwise Euclidean distance (and cosine similarity) between client PCA vectors for every heterogeneity type, including pure label shift. These distances are small but consistently non-zero, confirming a weak yet usable conditioning signal that explains why performance remains close to (but does not exceed) the Oracle. We have also clarified in the abstract and introduction that the “matches Oracle across all settings” statement holds under the concrete data-generation procedures used in the 97 configurations. revision: yes
Referee: [Evaluation] Evaluation section (implied by the 97-configuration results): the manuscript provides no details on the exact mechanism by which PCA statistics condition the global model, reports no error bars or variance across runs, and does not disclose potential post-hoc choices in configuration selection or metric aggregation. These omissions leave the central claim of Oracle-matching performance only partially supported.

Authors: We acknowledge these omissions reduce the reproducibility and strength of the central claim. The conditioning mechanism works by embedding the client’s top-k principal components and eigenvalues (flattened into a fixed-length vector) and concatenating this embedding to the input of the first layer of the global model; the rest of the network remains shared. We have expanded Section 3.2 with a precise architectural diagram and pseudocode describing this concatenation and the choice of k. In addition, all 97 configurations were pre-specified before any runs (following standard heterogeneity benchmarks from prior FL literature) with no post-hoc selection or metric aggregation choices; we now state this explicitly. Finally, we have re-run every experiment with five random seeds and added error bars (mean ± std) to all tables and figures in the revised manuscript. These changes fully support the reported Oracle-matching performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper introduces a client-conditioning approach based on local PCA statistics of training data and validates it through extensive empirical comparisons across 97 configurations, four datasets, and multiple baselines including an Oracle with true cluster assignments. No equations, derivations, or load-bearing steps are presented that reduce the claimed performance gains or Oracle-matching behavior to fitted parameters, self-citations, or inputs defined by the result itself. The central claims rest on external benchmark comparisons rather than any self-referential construction, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the domain assumption that PCA summaries capture sufficient heterogeneity information for effective conditioning. No free parameters or invented entities are explicitly named.

axioms (1)

domain assumption Local PCA statistics of client training data capture the relevant dimensions of heterogeneity needed to condition the global model
This is the implicit premise that allows zero-communication conditioning to replace explicit clustering or per-client models.

pith-pipeline@v0.9.0 · 5442 in / 1255 out tokens · 47003 ms · 2026-05-15T12:55:07.609896+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conditioning a single global model on locally-computed PCA statistics of each client’s training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Client-conditional federated learning via local training data statistics,

R. Brännvall, “Client-conditional federated learning via local training data statistics,” inProc. IEEE FLICS 2026, 2026

work page 2026
[2]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProc. AISTATS, 2017

work page 2017
[3]

Advances and open problems in federated learning,

P. Kairouz, H. B. McMahanet al., “Advances and open problems in federated learning,” inFoundations and Trends in Machine Learning, vol. 14, no. 1–2, 2021

work page 2021
[4]

Federated learning survey: A multi-level taxonomy of aggregation techniques, experimental insights, and future frontiers,

M. Arbaoui, M.-e.-A. Brahmia, A. Rahmoun, and M. Zghal, “Federated learning survey: A multi-level taxonomy of aggregation techniques, experimental insights, and future frontiers,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 6, 2024

work page 2024
[5]

Federated Learning with Non-IID Data

Y . Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V . Chandra, “Federated learning with non-IID data,”arXiv preprint arXiv:1806.00582, 2018

work page internal anchor Pith review arXiv 2018
[6]

An efficient framework for clustered federated learning,

A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” inNeurIPS, 2020

work page 2020
[7]

Clustered federated learn- ing: Model-agnostic distributed multitask optimization under privacy constraints,

F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learn- ing: Model-agnostic distributed multitask optimization under privacy constraints,” inIEEE Transactions on Neural Networks and Learning Systems, 2021

work page 2021
[8]

Ditto: Fair and robust federated learning through personalization,

T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and robust federated learning through personalization,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021
[9]

Personalized federated learning with Moreau envelopes,

A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with Moreau envelopes,” inNeurIPS, 2020

work page 2020
[10]

Decentralized adaptive clustering of deep nets is beneficial for client collaboration,

E. Listo Zec, E. Ekblom, M. Willbo, O. Mogren, and S. Girdzijauskas, “Decentralized adaptive clustering of deep nets is beneficial for client collaboration,” inWorkshop on Federated Learning: Recent Advances and New Challenges (FL-NeurIPS), 2022

work page 2022
[11]

Federated multi-task learning under a mixture of distributions,

O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal, “Federated multi-task learning under a mixture of distributions,”NeurIPS, 2021

work page 2021
[12]

Model-agnostic meta-learning for fast adaptation of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017

work page 2017
[13]

FedProx: Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “FedProx: Federated optimization in heterogeneous networks,” inPro- ceedings of Machine Learning and Systems (MLSys), 2020

work page 2020
[14]

Federated multi- task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi- task learning,” inNeurIPS, 2017

work page 2017
[15]

Personalized federated learning via feature distribution adaptation,

C. J. McLaughlin and L. Su, “Personalized federated learning via feature distribution adaptation,” inNeurIPS, 2024

work page 2024
[16]

Towards personalized federated learning,

A. Z. Tan, H. Yu, L. Cui, and Q. Yang, “Towards personalized federated learning,” inIEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, 2023

work page 2023
[17]

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” inNeurIPS, 2017

work page 2017
[18]

Communication-efficient distributed optimization in networks with gradient tracking,

J. Perazzone, S. Wang, M. Ji, and K. K. Leung, “Communication-efficient distributed optimization in networks with gradient tracking,” inIEEE Journal on Selected Areas in Communications, vol. 40, no. 7, 2022

work page 2022
[19]

Conditioning on local statistics for scalable heterogeneous federated learning,

R. Brännvall, “Conditioning on local statistics for scalable heterogeneous federated learning,” inICLR 2025 Workshop on Modular, Collaborative and Decentralized Deep Learning (MCDC), 2025

work page 2025
[20]

Personalized PCA: Decoupling shared and unique features,

N. Shi and R. Al Kontar, “Personalized PCA: Decoupling shared and unique features,”Journal of Machine Learning Research, vol. 25, no. 41, 2024

work page 2024
[21]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” ATT Labs, vol. 2, 2010

work page 2010
[22]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms,”arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009

work page 2009
[24]

Efficient node selection in private personalized decentralized learning,

E. Listo Zec, J. Östman, O. Mogren, and D. Gillblad, “Efficient node selection in private personalized decentralized learning,” inNorthern Lights Deep Learning Conference (NLDL), 2024

work page 2024
[25]

FiLM: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI Conference on Artificial Intelligence, 2018. 9 APPENDIXA PRELIMINARYRESULTS FROMMCDC 2025 WORKSHOPPAPER The local characteristic statistics conditioning approach was first presented as a non-archival paper ...

work page arXiv 2018
[26]

PCA eigenvaluesinstead of eigenvectors, and computed on learned embeddings rather than raw pixels for CIFAR datasets, providing a compact scalar representation of each client’s data distribution

work page
[27]

Concatenationat the FC layer (same architecture as EMNIST), with three alternative conditioning architectures (conditional linear, ensemble) dropped in favor of this single, simpler approach

work page
[28]

Image classificationwith CNN on four datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) instead of synthetic tasks and character recognition

work page
[29]

Systematic heterogeneity taxonomy: label shift, covariate shift, concept shift, and combined heterogeneity (97 configurations)

work page
[30]

fruit and vegetables

Seven baselines(FedAvg, Gossip, Local, Oracle, IFCA, DAC, Ditto) instead of three reference models (global, cluster, client) 6)Sparsity analysisshowing unique invariance to client data volume Addressing MCDC reviewer feedback.The MCDC reviewers requested evaluation on more diverse and complex datasets beyond EMNIST, robustness analysis across data sparsit...

work page 2000
[31]

It achieves this by learning to condition on client-specific PCA statistics

Conditional is the only method that consistently matches Oraclewithout requiring cluster information. It achieves this by learning to condition on client-specific PCA statistics

work page
[32]

Conditional can beat Oraclewhen heterogeneity is multi-dimensional (E3b, E4b), because the statistics capture richer information than discrete cluster membership

work page
[33]

Conditional is sparsity-invariant: Performance remains stable from Rich ( ∼6000 samples/client) to Super Sparse ( ∼300 samples/client), while other methods degrade significantly

work page
[34]

Clustering methods struggle with complex heterogeneity: IFCA achieves ARI=1.0 on simple domain shift but ARI=0.0 on combined heterogeneity

work page
[35]

local fine-tuning

FedAvg and Gossip collapse under concept shift: When label semantics differ across clients, naive averaging destroys information. APPENDIXK BASELINEIMPLEMENTATIONDETAILS This section documents the implementation of each baseline method, including deviations from the original papers and their justifications. Ensuring faithful baseline implementations is cr...

work page 2020

[1] [1]

Client-conditional federated learning via local training data statistics,

R. Brännvall, “Client-conditional federated learning via local training data statistics,” inProc. IEEE FLICS 2026, 2026

work page 2026

[2] [2]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProc. AISTATS, 2017

work page 2017

[3] [3]

Advances and open problems in federated learning,

P. Kairouz, H. B. McMahanet al., “Advances and open problems in federated learning,” inFoundations and Trends in Machine Learning, vol. 14, no. 1–2, 2021

work page 2021

[4] [4]

Federated learning survey: A multi-level taxonomy of aggregation techniques, experimental insights, and future frontiers,

M. Arbaoui, M.-e.-A. Brahmia, A. Rahmoun, and M. Zghal, “Federated learning survey: A multi-level taxonomy of aggregation techniques, experimental insights, and future frontiers,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 6, 2024

work page 2024

[5] [5]

Federated Learning with Non-IID Data

Y . Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V . Chandra, “Federated learning with non-IID data,”arXiv preprint arXiv:1806.00582, 2018

work page internal anchor Pith review arXiv 2018

[6] [6]

An efficient framework for clustered federated learning,

A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” inNeurIPS, 2020

work page 2020

[7] [7]

Clustered federated learn- ing: Model-agnostic distributed multitask optimization under privacy constraints,

F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learn- ing: Model-agnostic distributed multitask optimization under privacy constraints,” inIEEE Transactions on Neural Networks and Learning Systems, 2021

work page 2021

[8] [8]

Ditto: Fair and robust federated learning through personalization,

T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and robust federated learning through personalization,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021

[9] [9]

Personalized federated learning with Moreau envelopes,

A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with Moreau envelopes,” inNeurIPS, 2020

work page 2020

[10] [10]

Decentralized adaptive clustering of deep nets is beneficial for client collaboration,

E. Listo Zec, E. Ekblom, M. Willbo, O. Mogren, and S. Girdzijauskas, “Decentralized adaptive clustering of deep nets is beneficial for client collaboration,” inWorkshop on Federated Learning: Recent Advances and New Challenges (FL-NeurIPS), 2022

work page 2022

[11] [11]

Federated multi-task learning under a mixture of distributions,

O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal, “Federated multi-task learning under a mixture of distributions,”NeurIPS, 2021

work page 2021

[12] [12]

Model-agnostic meta-learning for fast adaptation of deep networks,

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProceedings of the 34th International Conference on Machine Learning (ICML), 2017

work page 2017

[13] [13]

FedProx: Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “FedProx: Federated optimization in heterogeneous networks,” inPro- ceedings of Machine Learning and Systems (MLSys), 2020

work page 2020

[14] [14]

Federated multi- task learning,

V . Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi- task learning,” inNeurIPS, 2017

work page 2017

[15] [15]

Personalized federated learning via feature distribution adaptation,

C. J. McLaughlin and L. Su, “Personalized federated learning via feature distribution adaptation,” inNeurIPS, 2024

work page 2024

[16] [16]

Towards personalized federated learning,

A. Z. Tan, H. Yu, L. Cui, and Q. Yang, “Towards personalized federated learning,” inIEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, 2023

work page 2023

[17] [17]

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” inNeurIPS, 2017

work page 2017

[18] [18]

Communication-efficient distributed optimization in networks with gradient tracking,

J. Perazzone, S. Wang, M. Ji, and K. K. Leung, “Communication-efficient distributed optimization in networks with gradient tracking,” inIEEE Journal on Selected Areas in Communications, vol. 40, no. 7, 2022

work page 2022

[19] [19]

Conditioning on local statistics for scalable heterogeneous federated learning,

R. Brännvall, “Conditioning on local statistics for scalable heterogeneous federated learning,” inICLR 2025 Workshop on Modular, Collaborative and Decentralized Deep Learning (MCDC), 2025

work page 2025

[20] [20]

Personalized PCA: Decoupling shared and unique features,

N. Shi and R. Al Kontar, “Personalized PCA: Decoupling shared and unique features,”Journal of Machine Learning Research, vol. 25, no. 41, 2024

work page 2024

[21] [21]

Mnist handwritten digit database,

Y . LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” ATT Labs, vol. 2, 2010

work page 2010

[22] [22]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms,”arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009

work page 2009

[24] [24]

Efficient node selection in private personalized decentralized learning,

E. Listo Zec, J. Östman, O. Mogren, and D. Gillblad, “Efficient node selection in private personalized decentralized learning,” inNorthern Lights Deep Learning Conference (NLDL), 2024

work page 2024

[25] [25]

FiLM: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI Conference on Artificial Intelligence, 2018. 9 APPENDIXA PRELIMINARYRESULTS FROMMCDC 2025 WORKSHOPPAPER The local characteristic statistics conditioning approach was first presented as a non-archival paper ...

work page arXiv 2018

[26] [26]

PCA eigenvaluesinstead of eigenvectors, and computed on learned embeddings rather than raw pixels for CIFAR datasets, providing a compact scalar representation of each client’s data distribution

work page

[27] [27]

Concatenationat the FC layer (same architecture as EMNIST), with three alternative conditioning architectures (conditional linear, ensemble) dropped in favor of this single, simpler approach

work page

[28] [28]

Image classificationwith CNN on four datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) instead of synthetic tasks and character recognition

work page

[29] [29]

Systematic heterogeneity taxonomy: label shift, covariate shift, concept shift, and combined heterogeneity (97 configurations)

work page

[30] [30]

fruit and vegetables

Seven baselines(FedAvg, Gossip, Local, Oracle, IFCA, DAC, Ditto) instead of three reference models (global, cluster, client) 6)Sparsity analysisshowing unique invariance to client data volume Addressing MCDC reviewer feedback.The MCDC reviewers requested evaluation on more diverse and complex datasets beyond EMNIST, robustness analysis across data sparsit...

work page 2000

[31] [31]

It achieves this by learning to condition on client-specific PCA statistics

Conditional is the only method that consistently matches Oraclewithout requiring cluster information. It achieves this by learning to condition on client-specific PCA statistics

work page

[32] [32]

Conditional can beat Oraclewhen heterogeneity is multi-dimensional (E3b, E4b), because the statistics capture richer information than discrete cluster membership

work page

[33] [33]

Conditional is sparsity-invariant: Performance remains stable from Rich ( ∼6000 samples/client) to Super Sparse ( ∼300 samples/client), while other methods degrade significantly

work page

[34] [34]

Clustering methods struggle with complex heterogeneity: IFCA achieves ARI=1.0 on simple domain shift but ARI=0.0 on combined heterogeneity

work page

[35] [35]

local fine-tuning

FedAvg and Gossip collapse under concept shift: When label semantics differ across clients, naive averaging destroys information. APPENDIXK BASELINEIMPLEMENTATIONDETAILS This section documents the implementation of each baseline method, including deviations from the original papers and their justifications. Ensuring faithful baseline implementations is cr...

work page 2020