arxiv: 2605.07962 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.DC

Recognition: 1 theorem link

· Lean Theorem

FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning

Fabian Stricker , Jose A. Peregrina , David Bermbach , Christian Zirpins

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:33 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords federated learningperformance evaluationaggregatable measurescentralized evaluationmachine learningmodel performanceFLAM

0 comments

The pith

FLAM lets federated learning evaluate global model performance exactly as if all test data were in one place.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Federated learning keeps data local, forcing coordinators to aggregate metrics computed separately on each participant's data. Common aggregation methods like weighting by sample count do not always match the result from evaluating on a single combined test set. The paper identifies the causes of these mismatches for general metrics and proposes FLAM to use specially defined aggregatable measures instead. This approach produces identical values to centralized evaluation while respecting data privacy and avoiding the collection of a global test set.

Core claim

The authors propose FLAM, a performance evaluation method based on aggregatable measures that yields the same results as centralized evaluation without the need for a global test dataset. By redefining how measures are computed locally, the aggregated value exactly equals what would be obtained if all data were centralized, eliminating inconsistencies that arise with standard approaches.

What carries the argument

FLAM, the method that employs aggregatable measures to ensure local evaluations combine to match centralized performance scores exactly.

If this is right

Global performance can be assessed accurately using only local computations and aggregates.
The method applies to a wide range of performance metrics, not limited to accuracy.
Evaluation results support deployment decisions that align with centralized benchmarks.
Privacy is maintained since no raw test data needs to be shared or centralized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework might extend to other evaluation tasks in distributed systems where global consistency is required.
It could improve the fairness of comparisons between different federated learning algorithms.
Researchers may develop libraries of pre-defined aggregatable versions for popular metrics.

Load-bearing premise

Performance measures can always be redefined such that they aggregate exactly to the centralized value for any data partition and without assumptions about the distribution of data or model outputs.

What would settle it

Split a dataset into client partitions, train a model in federated fashion, apply FLAM to compute a metric such as the F1 score, and verify whether it equals the F1 score computed on the full pooled test data; a mismatch would show the method does not achieve exact equivalence.

Figures

Figures reproduced from arXiv: 2605.07962 by Christian Zirpins, David Bermbach, Fabian Stricker, Jose A. Peregrina.

**Figure 2.** Figure 2: CIFAR-10 MS: Comparison of aggregated metrics [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: PVOD: Comparison of FLAM, coordinator-side calcu [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Covertype LS with α = 0.6: Comparison of FLAM against coordinator-side calculated metrics. Fig. 6a and Fig. 6b depict, for all metrics, that they achieved the same results as the centralized evaluation. Following these results, we showed that our method can achieve the same results as centralized evaluation based on a coordinator-side test dataset for commonly used metrics as long as a suitable set of AMs … view at source ↗

**Figure 5.** Figure 5: CIFAR-10 MS: Comparison of FLAM against coordinator-side calculated metrics. B. Results We further show the correctness of the AMs selection by running experiments with the same setup as described in Section IV and including the evaluation results of FLAM. Here, we show the results for the experiments that had the highest deviation for the different dataset types. We start with the regression task in [PIT… view at source ↗

read the original abstract

Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric. To address this issue, we examine the underlying reasons for these discrepancies and propose FLAM, a performance evaluation method based on aggregatable measures that yields the same results as centralized evaluation without the need for a global test dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLAM redefines some metrics to aggregate locally to the exact centralized value, but the approach is limited to additive cases and the paper needs to prove it works beyond accuracy.

read the letter

The main point is that FLAM gives a way to compute performance numbers locally in federated learning so that simple aggregation recovers the exact result you would get from pooling all test data centrally. The authors correctly flag that weighted averages of local metrics often diverge from the global calculation, which creates inconsistent reporting when data cannot move. They respond by defining aggregatable measures that avoid this mismatch without requiring a shared test set. This is a practical step for FL work where privacy rules out central evaluation. It does the basic job of spelling out the inconsistency and offering a definitional fix for metrics that break down into local sums or counts. The limitation is clear from the stress-test note: metrics like AUC or average precision rely on global ordering or thresholds across all predictions, so local aggregates cannot recover the centralized value without extra data sharing or distributional assumptions. The abstract claims generality, but nothing in the provided description shows how FLAM handles non-additive cases. If the full paper only demonstrates the method on accuracy and similar count-based metrics, the contribution shrinks to an incremental clarification rather than a broad solution. Experiments comparing FLAM aggregates directly to centralized baselines on multiple metrics would help, along with explicit definitions for each. This paper is aimed at FL researchers and practitioners who need reliable global performance numbers under data constraints. It is solid enough on the problem statement to deserve peer review, even if revisions will likely narrow the scope or add the missing checks for non-decomposable metrics.

Referee Report

2 major / 1 minor

Summary. The paper identifies that standard aggregation strategies (e.g., weighted averaging of local metrics) in federated learning often fail to match the results of centralized evaluation on pooled data, particularly for metrics beyond accuracy. It proposes FLAM, a performance evaluation method based on aggregatable measures that is claimed to recover exactly the centralized metric values without requiring participants to share a global test dataset.

Significance. If the central claim holds and FLAM provides exact equivalence for arbitrary performance metrics via local aggregates alone, the work would be significant for practical FL deployments: it would allow coordinators to obtain reliable, privacy-preserving global performance numbers that align with the FL objective of training on distributed data. The approach could reduce inconsistencies that currently lead to incorrect model assessment.

major comments (2)

[Abstract] Abstract: the claim that FLAM 'yields the same results as centralized evaluation' for general metrics is not accompanied by any formal definition of 'aggregatable measures' or a derivation showing how local statistics combine to recover the global value; without these, the generality asserted in the abstract cannot be verified.
[Abstract] Abstract: the method is presented as addressing discrepancies for 'other metrics' beyond accuracy, yet the skeptic's observation that non-additive metrics (AUC-ROC, average precision) depend on global ordering or thresholds rather than summable counts is not addressed; if FLAM relies only on summable local statistics, the central claim of equivalence for arbitrary metrics is at risk.

minor comments (1)

[Abstract] The abstract would benefit from a single concrete example (e.g., how precision or F1 is expressed as aggregatable measures) to illustrate the proposed solution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment below with clarifications on the formal foundations and intended scope of FLAM, while noting where minor revisions to the abstract can improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that FLAM 'yields the same results as centralized evaluation' for general metrics is not accompanied by any formal definition of 'aggregatable measures' or a derivation showing how local statistics combine to recover the global value; without these, the generality asserted in the abstract cannot be verified.

Authors: The abstract is intentionally concise and summarizes the contribution without including full technical details. The formal definition of aggregatable measures (as performance measures expressible via locally computable and globally combinable statistics such as sums of confusion-matrix elements) appears in Section 3.1. The derivation establishing exact equivalence to the centralized metric value is given in Theorem 1 of Section 3.2. We will revise the abstract to include a brief reference to this formal framework for improved verifiability. revision: partial
Referee: [Abstract] Abstract: the method is presented as addressing discrepancies for 'other metrics' beyond accuracy, yet the skeptic's observation that non-additive metrics (AUC-ROC, average precision) depend on global ordering or thresholds rather than summable counts is not addressed; if FLAM relies only on summable local statistics, the central claim of equivalence for arbitrary metrics is at risk.

Authors: FLAM is explicitly scoped to aggregatable measures rather than arbitrary metrics, as indicated by the title and the phrasing 'based on aggregatable measures' in the abstract. Section 2 explains that standard aggregation fails for many metrics because they are not defined in terms of summable local statistics, and Section 4 explicitly states that non-additive metrics such as AUC-ROC (which require global ranking or thresholds) cannot be recovered exactly from local aggregates alone and fall outside FLAM's scope. The phrase 'other metrics' refers to other aggregatable metrics (e.g., precision, recall, F1) that can be recovered exactly. We will add a short clarifying clause in the abstract to emphasize this scope limitation. revision: partial

Circularity Check

0 steps flagged

FLAM is introduced as a definitional framework for aggregatable performance measures; no derivation reduces to fitted inputs or self-citations.

full rationale

The paper's core contribution is the proposal of FLAM, which redefines or selects performance measures such that they can be computed locally and aggregated (via summation or weighted sums) to exactly match centralized evaluation on pooled data. This equivalence is achieved by construction through the choice of aggregatable measures (e.g., those based on global counts like TP/FP for accuracy or precision). The abstract and description frame this as an examination of discrepancies followed by a new method, not a predictive derivation or theorem that loops back to its own assumptions. No load-bearing self-citations, uniqueness theorems from prior author work, or fitted parameters renamed as predictions are present in the provided text. The approach is self-contained as a definitional solution for the subclass of additive metrics, with any limitations on non-additive metrics (such as AUC) representing a scope restriction rather than circular reasoning. The derivation chain does not reduce any claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the contribution is described as a methodological proposal for aggregatable measures.

pith-pipeline@v0.9.0 · 5465 in / 1068 out tokens · 45011 ms · 2026-05-11T02:33:11.206342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Communication-Efficient Learning of Deep Networks from De- centralized Data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and Blaise Aguera y Arcas, “Communication-Efficient Learning of Deep Networks from De- centralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics. PMLR, 2017

work page 2017
[2]

Federated Learning for Mobile Keyboard Prediction,

A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augen- stein, H. Eichner, C. Kiddon, and D. Ramage, “Federated Learning for Mobile Keyboard Prediction,” 2019

work page 2019
[3]

LEAF: A Benchmark for Federated Settings,

S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Kone ˇcn´y, H. B. McMahan, V . Smith, and A. Talwalkar, “LEAF: A Benchmark for Federated Settings,” 2019

work page 2019
[4]

Oort: Ef- ficient Federated Learning via Guided Participant Selection,

F. Lai, X. Zhu, H. V . Madhyastha, and M. Chowdhury, “Oort: Ef- ficient Federated Learning via Guided Participant Selection,” in15th {USENIX}Symposium on Operating Systems Design and Implementa- tion ({OSDI}21), 2021

work page 2021
[5]

FLAME: Federated Learning Across Multi-device Environments,

H. Cho, A. Mathur, and F. Kawsar, “FLAME: Federated Learning Across Multi-device Environments,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022

work page 2022
[6]

TCT: Con- vexifying federated learning using bootstrapped neural tangent kernels,

Y . Yu, A. Wei, S. P. Karimireddy, Y . Ma, and M. Jordan, “TCT: Con- vexifying federated learning using bootstrapped neural tangent kernels,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds. Curran Associates, Inc., 2022

work page 2022
[7]

Client Availability in Federated Learning: It Matters!

D. Garg, D. Sanyal, M. Lee, A. Tumanov, and A. Gavrilovska, “Client Availability in Federated Learning: It Matters!” inProc. of the 5th Workshop on Machine Learning and Systems, ser. EuroMLSys ’25. ACM, 2025

work page 2025
[8]

MimiC: Combating Client Dropouts in Federated Learning by Mimicking Central Updates,

Y . Sun, Y . Mao, and J. Zhang, “MimiC: Combating Client Dropouts in Federated Learning by Mimicking Central Updates,”IEEE Trans. on Mobile Computing, 2024

work page 2024
[9]

Flower: A Friendly Federated Learning Research Framework,

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y . Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de Gusm ˜ao, and N. D. Lane, “Flower: A Friendly Federated Learning Research Framework,” 2022

work page 2022
[10]

FedEval: A Holistic Evaluation Framework for Federated Learning,

D. Chai, L. Wang, L. Yang, J. Zhang, K. Chen, and Q. Yang, “FedEval: A Holistic Evaluation Framework for Federated Learning,” 2022

work page 2022
[11]

A Survey for Federated Learning Evaluations: Goals and Mea- sures,

——, “A Survey for Federated Learning Evaluations: Goals and Mea- sures,”IEEE Transactions on Knowledge and Data Engineering, 2024

work page 2024
[12]

The OARF Bench- mark Suite: Characterization and Implications for Federated Learning Systems,

S. Hu, Y . Li, X. Liu, Q. Li, Z. Wu, and B. He, “The OARF Bench- mark Suite: Characterization and Implications for Federated Learning Systems,”ACM Trans. Intell. Syst. Technol., 2022

work page 2022
[13]

A Performance Evaluation of Federated Learning Algorithms,

A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand, “A Performance Evaluation of Federated Learning Algorithms,” inProceed- ings of the Second Workshop on Distributed Infrastructures for Deep Learning, ser. DIDL ’18. ACM, 2018

work page 2018
[14]

Evaluation Framework For Large- scale Federated Learning,

L. Liu, F. Zhang, J. Xiao, and C. Wu, “Evaluation Framework For Large- scale Federated Learning,” 2020

work page 2020
[15]

Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward,

P. Flach, “Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward,” inProceedings of the AAAI Conference on Artificial Intelligence, no. 01, 2019

work page 2019
[16]

NVIDIA FLARE: Federated Learning from Simulation to Real-World,

H. R. Roth, Y . Cheng, Y . Wen, I. Yang, Z. Xu, Y .-T. Hsieh, K. Kersten, A. Harouni, C. Zhao, K. Lu, Z. Zhang, W. Li, A. Myronenko, D. Yang, S. Yang, N. Rieke, A. Quraini, C. Chen, D. Xu, N. Ma, P. Dogra, M. Flores, and A. Feng, “NVIDIA FLARE: Federated Learning from Simulation to Real-World,” 2022

work page 2022
[17]

Ditto: Fair and Robust Federated Learning Through Personalization,

T. Li, S. Hu, A. Beirami, and V . Smith, “Ditto: Fair and Robust Federated Learning Through Personalization,” inProceedings of the 38th International Conference on Machine Learning. PMLR, 2021

work page 2021
[18]

Fair Resource Allocation in Federated Learning,

T. Li, M. Sanjabi, A. Beirami, and V . Smith, “Fair Resource Allocation in Federated Learning,” 2020

work page 2020
[19]

Analyzing the Impact of Participant Failures in Cross-Silo Federated Learning,

F. Stricker, D. Bermbach, and C. Zirpins, “Analyzing the Impact of Participant Failures in Cross-Silo Federated Learning,” 2025

work page 2025
[20]

Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice,

K. Kuo, C. Yadav, and Virginia Smith, “Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice,” 2025

work page 2025
[21]

Learning Multiple Layers of Features from Tiny Im- ages,

A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Im- ages,” University of Toronto, Tech. Rep., 2009

work page 2009
[22]

Covertype,

J. Blackard, “Covertype,” 1998

work page 1998
[23]

PVOD v1.0 : A photovoltaic power output dataset,

T. Yao, J. Wang, H. Wu, P. Zhang, S. Li, Y . Wang, X. Chi, and M. Shi, “PVOD v1.0 : A photovoltaic power output dataset,” 2021

work page 2021
[24]

GermanSolar- Farm data set,

A. Gensler, J. Henze, N. Raabe, and V . Pankraz, “GermanSolar- Farm data set,” https://www.uni-kassel.de/eecs/en/sections/intelligent- embedded-systems/downloads.html, 2016

work page 2016
[25]

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimiza- tion,

J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V . Poor, “Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimiza- tion,” inAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2020

work page 2020
[26]

Ensemble Distillation for Robust Model Fusion in Federated Learning,

T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble Distillation for Robust Model Fusion in Federated Learning,” inAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2020

work page 2020
[27]

TabTransformer: Tabular Data Modeling Using Contextual Embeddings,

X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTransformer: Tabular Data Modeling Using Contextual Embeddings,” 2020

work page 2020
[28]

Comparing two K-category assignments by a K-category correlation coefficient,

J. Gorodkin, “Comparing two K-category assignments by a K-category correlation coefficient,”Computational Biology and Chemistry, 2004

work page 2004
[29]

A Comparison of MCC and CEN Error Measures in Multi-Class Prediction,

G. Jurman, S. Riccadonna, and C. Furlanello, “A Comparison of MCC and CEN Error Measures in Multi-Class Prediction,”PLOS ONE, 2012

work page 2012
[30]

3.4. Metrics and scoring: Quantifying the quality of predictions,

“3.4. Metrics and scoring: Quantifying the quality of predictions,” https://scikit-learn.org/stable/modules/model evaluation.html

work page
[31]

M. H. Kutner and J. Neter,Applied Linear Regression Models, ser. Irwin/McGraw-hill Series in Operations and Decision Sciences. McGraw-Hill/Irwin, 2004

work page 2004
[32]

Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores,

M. C. Hinojosa Lee, J. Braet, and J. Springael, “Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores,”Applied Sciences, 2024

work page 2024
[33]

Practical secure aggregation for privacy-preserving machine learning

K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning.” Association for Computing Machinery, 2017

work page 2017
[34]

LightSecAgg: A Lightweight and Versatile Design for Secure Aggregation in Federated Learning,

J. So, C. He, C.-S. Yang, S. Li, Q. Yu, R. E. Ali, B. Guler, and S. Avestimehr, “LightSecAgg: A Lightweight and Versatile Design for Secure Aggregation in Federated Learning,”Proceedings of Machine Learning and Systems, 2022

work page 2022